System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals

ABSTRACT

A system and method for determining the genetic data for one or a small set of cells, or from fragmentary DNA, where a limited quantity of genetic data is available, are disclosed. Genetic data for the target individual is acquired and amplified using known methods, and poorly measured base pairs, missing alleles and missing regions are reconstructed using expected similarities between the target genome and the genome of genetically related subjects. In accordance with one embodiment of the invention, incomplete genetic data is acquired from embryonic cells, fetal cells, or cell-free fetal DNA isolated from the mother&#39;s blood, and the incomplete genetic data is reconstructed using the more complete genetic data from a larger sample diploid cells from one or both parents, with or without genetic data from haploid cells from one or both parents, and/or genetic data taken from other related individuals.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. Utility application Ser. No.16/399,931, filed Apr. 30, 2019. U.S. Utility application Ser. No.16/399,931 is a continuation of U.S. Utility application Ser. No.16/288,690, filed Feb. 28, 2019. U.S. Utility application Ser. No.16/288,690 is a continuation of U.S. Utility application Ser. No.15/187,555, filed Jun. 20, 2016 (now U.S. Pat. No. 10,227,652). U.S.Utility application Ser. No. 15/187,555 (now U.S. Pat. No. 10,227,652)is a continuation of U.S. Utility application Ser. No. 14/092,457, filedNov. 27, 2013 (now U.S. Pat. No. 9,430,611). U.S. Utility applicationSer. No. 14/092,457 (now U.S. Pat. No. 9,430,611) is a continuation ofU.S. Utility application Ser. No. 13/793,133, filed Mar. 11, 2013 (nowU.S. Pat. No. 9,424,392), and U.S. Utility application Ser. No.13/793,186, filed Mar. 11, 2013 (now U.S. Pat. No. 8,682,592). U.S.Utility application Ser. No. 13/793,133, now U.S. Pat. No. 9,424,392, isa continuation of U.S. Utility application Ser. No. 11/603,406, filedNov. 22, 2006, now U.S. Pat. No. 8,532,930. U.S. Utility applicationSer. No. 13/793,186, now U.S. Pat. No. 8,682,592, is a continuation ofU.S. Utility application Ser. No. 11/603,406, filed Nov. 22, 2006, nowU.S. Pat. No. 8,532,930. U.S. Utility application Ser. No. 11/603,406,now U.S. Pat. No. 8,532,930, claims the benefit under 35 U.S.C. § 119(e)of the following U.S. Provisional Patent Applications: Ser. No.60/739,882, filed Nov. 26, 2005; Ser. No. 60/742,305, filed Dec. 6,2005; Ser. No. 60/754,396, filed Dec. 29, 2005; Ser. No. 60/774,976,filed Feb. 21, 2006; Ser. No. 60/789,506, filed Apr. 4, 2006; and Ser.No. 60/817,741, filed Jun. 30, 2006; and Ser. No. 60/846,610, filed Sep.22, 2006; the disclosures of the patent applications cited in thisparagraph are each incorporated by reference herein in their entirety.

BACKGROUND OF THE INVENTION Field of the Invention

The invention relates generally to the field of acquiring, manipulatingand using genetic data for medically predictive purposes, andspecifically to a system in which imperfectly measured genetic data ismade more precise by using known genetic data of genetically relatedindividuals, thereby allowing more effective identification of geneticirregularities that could result in various phenotypic outcomes.

Description of the Related Art

Current methods of prenatal diagnosis can alert physicians and parentsto abnormalities in growing fetuses. Without prenatal diagnosis, one in50 babies is born with serious physical or mental handicap, and as manyas one in 30 will have some form of congenital malformation.Unfortunately, standard methods require invasive testing and carry aroughly 1 percent risk of miscarriage. These methods includeamniocentesis, chorion villus biopsy and fetal blood sampling. Of these,amniocentesis is the most common procedure; in 2003, it was performed inapproximately 3% of all pregnancies, though its frequency of use hasbeen decreasing over the past decade and a half. A major drawback ofprenatal diagnosis is that given the limited courses of action once anabnormality has been detected, it is only valuable and ethical to testfor very serious defects. As result, prenatal diagnosis is typicallyonly attempted in cases of high-risk pregnancies, where the elevatedchance of a defect combined with the seriousness of the potentialabnormality outweighs the risks. A need exists for a method of prenataldiagnosis that mitigates these risks.

It has recently been discovered that cell-free fetal DNA and intactfetal cells can enter maternal blood circulation. Consequently, analysisof these cells can allow early Non-Invasive Prenatal Genetic Diagnosis(NIPGD). A key challenge in using NIPGD is the task of identifying andextracting fetal cells or nucleic acids from the mother's blood. Thefetal cell concentration in maternal blood depends on the stage ofpregnancy and the condition of the fetus, but estimates range from oneto forty fetal cells in every milliliter of maternal blood, or less thanone fetal cell per 100,000 maternal nucleated cells. Current techniquesare able to isolate small quantities of fetal cells from the mother'sblood, although it is very difficult to enrich the fetal cells to purityin any quantity. The most effective technique in this context involvesthe use of monoclonal antibodies, but other techniques used to isolatefetal cells include density centrifugation, selective lysis of adulterythrocytes, and FACS. Fetal DNA isolation has been demonstrated usingPCR amplification using primers with fetal-specific DNA sequences. Sinceonly tens of molecules of each embryonic SNP are available through thesetechniques, the genotyping of the fetal tissue with high fidelity is notcurrently possible.

Much research has been done towards the use of pre-implantation geneticdiagnosis (PGD) as an alternative to classical prenatal diagnosis ofinherited disease. Most PGD today focuses on high-level chromosomalabnormalities such as aneuploidy and balanced translocations with theprimary outcomes being successful implantation and a take-home baby. Aneed exists for a method for more extensive genotyping of embryos at thepre-implantation stage. The number of known disease associated geneticalleles is currently at 389 according to OMIM and steadily climbing.Consequently, it is becoming increasingly relevant to analyze multipleembryonic SNPs that are associated with disease phenotypes. A clearadvantage of pre-implantation genetic diagnosis over prenatal diagnosisis that it avoids some of the ethical issues regarding possible choicesof action once undesirable phenotypes have been detected.

Many techniques exist for isolating single cells. The FACS machine has avariety of applications; one important application is to discriminatebetween cells based on size, shape and overall DNA content. The FACSmachine can be set to sort single cells into any desired container. Manydifferent groups have used single cell DNA analysis for a number ofapplications, including prenatal genetic diagnosis, recombinationstudies, and analysis of chromosomal imbalances. Single-sperm genotypinghas been used previously for forensic analysis of sperm samples (todecrease problems arising from mixed samples) and for single-cellrecombination studies.

Isolation of single cells from human embryos, while highly technical, isnow routine in in vitro fertilization clinics. To date, the vastmajority of prenatal diagnoses have used fluorescent in situhybridization (FISH), which can determine large chromosomal aberrations(such as Down syndrome, or trisomy 21) and PCR/electrophoresis, whichcan determine a handful of SNPs or other allele calls. Both polar bodiesand blastomeres have been isolated with success. It is critical toisolate single blastomeres without compromising embryonic integrity. Themost common technique is to remove single blastomeres from day 3 embryos(6 or 8 cell stage). Embryos are transferred to a special cell culturemedium (standard culture medium lacking calcium and magnesium), and ahole is introduced into the zona pellucida using an acidic solution,laser, or mechanical drilling. The technician then uses a biopsy pipetteto remove a single visible nucleus. Clinical studies have demonstratedthat this process does not decrease implantation success, since at thisstage embryonic cells are undifferentiated.

There are three major methods available for whole genome amplification(WGA): ligation-mediated PCR (LM-PCR), degenerate oligonucleotide primerPCR (DOP-PCR), and multiple displacement amplification (MDA). In LM-PCR,short DNA sequences called adapters are ligated to blunt ends of DNA.These adapters contain universal amplification sequences, which are usedto amplify the DNA by PCR. In DOP-PCR, random primers that also containuniversal amplification sequences are used in a first round of annealingand PCR. Then, a second round of PCR is used to amplify the sequencesfurther with the universal primer sequences. Finally, MDA uses thephi-29 polymerase, which is a highly processive and non-specific enzymethat replicates DNA and has been used for single-cell analysis. Of thethree methods, DOP-PCR reliably produces large quantities of DNA fromsmall quantities of DNA, including single copies of chromosomes. On theother hand, MDA is the fastest method, producing hundred-foldamplification of DNA in a few hours. The major limitations to amplifyingmaterial from a single cell are (1) necessity of using extremely diluteDNA concentrations or extremely small volume of reaction mixture, and(2) difficulty of reliably dissociating DNA from proteins across thewhole genome. Regardless, single-cell whole genome amplification hasbeen used successfully for a variety of applications for a number ofyears.

There are numerous difficulties in using DNA amplification in thesecontexts. Amplification of single-cell DNA (or DNA from a small numberof cells, or from smaller amounts of DNA) by PCR can fail completely, asreported in 5-10% of the cases. This is often due to contamination ofthe DNA, the loss of the cell, its DNA, or accessibility of the DNAduring the PCR reaction. Other sources of error that may arise inmeasuring the embryonic DNA by amplification and microarray analysisinclude transcription errors introduced by the DNA polymerase where aparticular nucleotide is incorrectly copied during PCR, and microarrayreading errors due to imperfect hybridization on the array. The biggestproblem, however, remains allele drop-out (ADO) defined as the failureto amplify one of the two alleles in a heterozygous cell. ADO can affectup to more than 40% of amplifications and has already caused PGDmisdiagnoses. ADO becomes a health issue especially in the case of adominant disease, where the failure to amplify can lead to implantationof an affected embryo. The need for more than one set of primers pereach marker (in heterozygotes) complicate the PCR process. Therefore,more reliable PCR assays are being developed based on understanding theADO origin. Reaction conditions for single-cell amplifications are understudy. The amplicon size, the amount of DNA degradation, freezing andthawing, and the PCR program and conditions can each influence the rateof ADO.

All those techniques, however, depend on the minute DNA amount availablefor amplification in the single cell. This process is often accompaniedby contamination. Proper sterile conditions and microsatellite sizingcan exclude the chance of contaminant DNA as microsatellite analysisdetected only in parental alleles exclude contamination. Studies toreliably transfer molecular diagnostic protocols to the single-celllevel have been recently pursued using first-round multiplex PCR ofmicrosatellite markers, followed by real-time PCR and microsatellitesizing to exclude chance contamination. Multiplex PCR allows for theamplification of multiple fragments in a single reaction, a crucialrequirement in the single-cell DNA analysis. Although conventional PCRwas the first method used in PGD, fluorescence in situ hybridization(FISH) is now common. It is a delicate visual assay that allows thedetection of nucleic acid within undisturbed cellular and tissuearchitecture. It relies firstly on the fixation of the cells to beanalyzed. Consequently, optimization of the fixation and storagecondition of the sample is needed, especially for single-cellsuspensions.

Advanced technologies that enable the diagnosis of a number of diseasesat the single-cell level include interphase chromosome conversion,comparative genomic hybridization (CGH), fluorescent PCR, and wholegenome amplification. The reliability of the data generated by all ofthese techniques rely on the quality of the DNA preparation. PGD is alsocostly, consequently there is a need for less expensive approaches, suchas mini-sequencing. Unlike most mutation-detection techniques,mini-sequencing permits analysis of very small DNA fragments with lowADO rate. Better methods for the preparation of single-cell DNA foramplification and PGD are therefore needed and are under study. The morenovel microarrays and comparative genomic hybridization techniques,still ultimately rely on the quality of the DNA under analysis.

Several techniques are in development to measure multiple SNPs on theDNA of a small number of cells, a single cell (for example, ablastomere), a small number of chromosomes, or from fragments of DNA.There are techniques that use Polymerase Chain Reaction (PCR), followedby microarray genotyping analysis. Some PCR-based techniques includewhole genome amplification (WGA) techniques such as multipledisplacement amplification (MDA), and MOLECULAR INVERSION PROBES (MIPs)that perform genotyping using multiple tagged oligonucleotides that maythen be amplified using PCR with a single pair of primers. An example ofa non-PCR based technique is fluorescence in situ hybridization (FISH).It is apparent that the techniques will be severely error-prone due tothe limited amount of genetic material which will exacerbate the impactof effects such as allele drop-outs, imperfect hybridization, andcontamination.

Many techniques exist which provide genotyping data. TAQMAN is a uniquegenotyping technology produced and distributed by Applied Biosystems.TAQMAN uses polymerase chain reaction (PCR) to amplify sequences ofinterest. During PCR cycling, an allele specific minor groove binder(MGB) probe hybridizes to amplified sequences. Strand synthesis by thepolymerase enzymes releases reporter dyes linked to the MGB probes, andthen the TAQMAN optical readers detect the dyes. In this manner, TAQMANachieves quantitative allelic discrimination. Compared with array basedgenotyping technologies, TAQMAN is quite expensive per reaction(˜$0.40/reaction), and throughput is relatively low (384 genotypes perrun). While only Ing of DNA per reaction is necessary, thousands ofgenotypes by TAQMAN requires microgram quantities of DNA, so TAQMAN doesnot necessarily use less DNA than microarrays. However, with respect tothe IVF genotyping workflow, TAQMAN is the most readily applicabletechnology. This is due to the high reliability of the assays and, mostimportantly, the speed and ease of the assay (˜3 hours per run andminimal molecular biological steps). Also unlike many array technologies(such as 500k AFFMETRIX arrays), TAQMAN is highly customizable, which isimportant for the IVF market. Further, TAQMAN is highly quantitative, soaneuploidies could be detected with this technology alone.

ILLUMINA has recently emerged as a leader in high-throughput genotyping.Unlike AFFMETRIX, ILLUMINA genotyping arrays do not rely exclusively onhybridization. Instead, ILLUMINA technology uses an allele-specific DNAextension step, which is much more sensitive and specific thanhybridization alone, for the original sequence detection. Then, all ofthese alleles are amplified in multiplex by PCR, and then these productshybridized to bead arrays. The beads on these arrays contain unique“address” tags, not native sequence, so this hybridization is highlyspecific and sensitive. Alleles are then called by quantitative scanningof the bead arrays. The Illlumina GOLDEN GATE assay system genotypes upto 1536 loci concurrently, so the throughput is better than TAQMAN butnot as high as AFFMETRIX 500k arrays. The cost of ILLUMINA genotypes islower than TAQMAN, but higher than AFFMETRIX arrays. Also, the ILLUMINAplatform takes as long to complete as the 500k AFFMETRIX arrays (up to72 hours), which is problematic for IVF genotyping. However, ILLUMINAhas a much better call rate, and the assay is quantitative, soaneuploidies are detectable with this technology. ILLUMINA technology ismuch more flexible in choice of SNPs than 500k AFFMETRIX arrays.

One of the highest throughput techniques, which allows for themeasurement of up to 250,000 SNPs at a time, is the AFFMETRIX GeneChip500K genotyping array. This technique also uses PCR, followed byanalysis by hybridization and detection of the amplified DNA sequencesto DNA probes, chemically synthesized at different locations on a quartzsurface. A disadvantage of these arrays are the low flexibility and thelower sensitivity. There are modified approaches that can increaseselectivity, such as the “perfect match” and “mismatch probe”approaches, but these do so at the cost of the number of SNPs calls perarray.

Pyrosequencing, or sequencing by synthesis, can also be used forgenotyping and SNP analysis. The main advantages to pyrosequencinginclude an extremely fast turnaround and unambiguous SNP calls, however,the assay is not currently conducive to high-throughput parallelanalysis. PCR followed by gel electrophoresis is an exceedingly simpletechnique that has met the most success in preimplantation diagnosis. Inthis technique, researchers use nested PCR to amplify short sequences ofinterest. Then, they run these DNA samples on a special gel to visualizethe PCR products. Different bases have different molecular weights, soone can determine base content based on how fast the product runs in thegel. This technique is low-throughput and requires subjective analysesby scientists using current technologies, but has the advantage of speed(1-2 hours of PCR, 1 hour of gel electrophoresis). For this reason, ithas been used previously for prenatal genotyping for a myriad ofdiseases, including: thalassaemia, neurofibromatosis type 2, leukocyteadhesion deficiency type I, Hallopeau-Siemens disease, sickle-cellanemia, retinoblastoma, Pelizaeus-Merzbacher disease, Duchenne musculardystrophy, and Currarino syndrome.

Another promising technique that has been developed for genotyping smallquantities of genetic material with very high fidelity is MOLECULARINVERSION PROBES (MIPs), such as AFFMETRIX's GENFLEX Arrays. Thistechnique has the capability to measure multiple SNPs in parallel: morethan 10,000 SNPS measured in parallel have been verified. For smallquantities of genetic material, call rates for this technique have beenestablished at roughly 95%, and accuracy of the calls made has beenestablished to be above 99%. So far, the technique has been implementedfor quantities of genomic data as small as 150 molecules for a givenSNP. However, the technique has not been verified for genomic data froma single cell, or a single strand of DNA, as would be required forpre-implantation genetic diagnosis.

The MIP technique makes use of padlock probes which are linearoligonucleotides whose two ends can be joined by ligation when theyhybridize to immediately adjacent target sequences of DNA. After theprobes have hybridized to the genomic DNA, a gap-fill enzyme is added tothe assay which can add one of the four nucleotides to the gap. If theadded nucleotide (A,C,T,G) is complementary to the SNP undermeasurement, then it will hybridize to the DNA, and join the ends of thepadlock probe by ligation. The circular products, or closed padlockprobes, are then differentiated from linear probes by exonucleolysis.The exonuclease, by breaking down the linear probes and leaving thecircular probes, will change the relative concentrations of the closedvs. the unclosed probes by a factor of 1000 or more. The probes thatremain are then opened at a cleavage site by another enzyme, removedfrom the DNA, and amplified by PCR. Each probe is tagged with adifferent tag sequence consisting of 20 base tags (16,000 have beengenerated), and can be detected, for example, by the AFFMETRIX GENFLEXTag Array. The presence of the tagged probe from a reaction in which aparticular gap-fill enzyme was added indicates the presence of thecomplimentary amino acid on the relevant SNP.

The molecular biological advantages of MIPS include: (1) multiplexedgenotyping in a single reaction, (2) the genotype “call” occurs by gapfill and ligation, not hybridization, and (3) hybridization to an arrayof universal tags decreases false positives inherent to most arrayhybridizations. In traditional 500K, TAQMAN and other genotyping arrays,the entire genomic sample is hybridized to the array, which contains avariety of perfect match and mismatch probes, and an algorithm callslikely genotypes based on the intensities of the mismatch and perfectmatch probes. Hybridization, however, is inherently noisy, because ofthe complexities of the DNA sample and the huge number of probes on thearrays. MIPs, on the other hand, uses multiplex probes (i.e., not on anarray) that are longer and therefore more specific, and then uses arobust ligation step to circularize the probe. Background is exceedinglylow in this assay (due to specificity), though allele dropout may behigh (due to poor performing probes).

When this technique is used on genomic data from a single cell (or smallnumbers of cells) it will—like PCR based approaches—suffer fromintegrity issues. For example, the inability of the padlock probe tohybridize to the genomic DNA will cause allele dropouts. This will beexacerbated in the context of in-vitro fertilization since theefficiency of the hybridization reaction is low, and it needs to proceedrelatively quickly in order to genotype the embryo in a limited timeperiod. Note that the hybridization reaction can be reduced well belowvendor-recommended levels, and micro-fluidic techniques may also be usedto accelerate the hybridization reaction. These approaches to reducingthe time for the hybridization reaction will result in reduced dataquality.

Once the genetic data has been measured, the next step is to use thedata for predictive purposes. Much research has been done in predictivegenomics, which tries to understand the precise functions of proteins,RNA and DNA so that phenotypic predictions can be made based ongenotype. Canonical techniques focus on the function ofSingle-Nucleotide Polymorphisms (SNP); but more advanced methods arebeing brought to bear on multi-factorial phenotypic features. Thesemethods include techniques, such as linear regression and nonlinearneural networks, which attempt to determine a mathematical relationshipbetween a set of genetic and phenotypic predictors and a set of measuredoutcomes. There is also a set of regression analysis techniques, such asRidge regression, log regression and stepwise selection, that aredesigned to accommodate sparse data sets where there are many potentialpredictors relative to the number of outcomes, as is typical of geneticdata, and which apply additional constraints on the regressionparameters so that a meaningful set of parameters can be resolved evenwhen the data is underdetermined. Other techniques apply principalcomponent analysis to extract information from undetermined data sets.Other techniques, such as decision trees and contingency tables, usestrategies for subdividing subjects based on their independent variablesin order to place subjects in categories or bins for which thephenotypic outcomes are similar. A recent technique, termed logicalregression, describes a method to search for different logicalinterrelationships between categorical independent variables in order tomodel a variable that depends on interactions between multipleindependent variables related to genetic data. Regardless of the methodused, the quality of the prediction is naturally highly dependent on thequality of the genetic data used to make the prediction.

Normal humans have two sets of 23 chromosomes in every diploid cell,with one copy coming from each parent. Aneuploidy, a cell with an extraor missing chromosomes, and uniparental disomy, a cell with two of agiven chromosome that originate from one parent, are believed to beresponsible for a large percentage of failed implantations,miscarriages, and genetic diseases. When only certain cells in anindividual are aneuploid, the individual is said to exhibit mosaicism.Detection of chromosomal abnormalities can identify individuals orembryos with conditions such as Down syndrome, Klinefelter syndrome, andTurner syndrome, among others, in addition to increasing the chances ofa successful pregnancy. Testing for chromosomal abnormalities isespecially important as mothers age: between the ages of 35 and 40 it isestimated that between 40% and 50% of the embryos are abnormal, andabove the age of 40, more than half of the embryos are abnormal.

Karyotyping, the traditional method used for the prediction ofaneuploides and mosaicism is giving way to other more high throughput,more cost effective methods. One method that has attracted muchattention recently is Flow cytometry (FC) and fluorescence in situhybridization (FISH) which can be used to detect aneuploidy in any phaseof the cell cycle. One advantage of this method is that it is lessexpensive than karyotyping, but the cost is significant enough thatgenerally a small selection of chromosomes are tested (usuallychromosomes 13, 18, 21, X, Y; also sometimes 8, 9, 15, 16, 17, 22); inaddition, FISH has a low level of specificity. Using FISH to analyze 15cells, one can detect mosaicism of 19% with 95% confidence. Thereliability of the test becomes much lower as the level of mosaicismgets lower, and as the number of cells to analyze decreases. The test isestimated to have a false negative rate as high as 15% when a singlecell is analyzed. There is a great demand for a method that has a higherthroughput, lower cost, and greater accuracy.

Listed here is a set of prior art which is related to the field of thecurrent invention. None of this prior art contains or in any way refersto the novel elements of the current invention. In U.S. Pat. No.6,720,140, Hartley et al describe a recombinational cloning method formoving or exchanging segments of DNA molecules using engineeredrecombination sites and recombination proteins. In U.S. Pat. No.6,489,135 Parrott et al. provide methods for determining variousbiological characteristics of in vitro fertilized embryos, includingoverall embryo health, implantability, and increased likelihood ofdeveloping successfully to term by analyzing media specimens of in vitrofertilization cultures for levels of bioactive lipids in order todetermine these characteristics. In US Patent Application 20040033596Threadgill et al. describe a method for preparing homozygous cellularlibraries useful for in vitro phenotyping and gene mapping involvingsite-specific mitotic recombination in a plurality of isolated parentcells. In U.S. Pat. No. 5,994,148 Stewart et al. describe a method ofdetermining the probability of an in vitro fertilization (IVF) beingsuccessful by measuring Relaxin directly in the serum or indirectly byculturing granulosa lutein cells extracted from the patient as part ofan IVF/ET procedure. In U.S. Pat. No. 5,635,366 Cooke et al. provide amethod for predicting the outcome of IVF by determining the level of11β-hydroxysteroid dehydrogenase (11β-HSD) in a biological sample from afemale patient. In U.S. Pat. No. 7,058,616 Larder et al. describe amethod for using a neural network to predict the resistance of a diseaseto a therapeutic agent. In U.S. Pat. No. 6,958,211 Vingerhoets et al.describe a method wherein the integrase genotype of a given HIV strainis simply compared to a known database of HIV integrase genotype withassociated phenotypes to find a matching genotype. In U.S. Pat. No.7,058,517 Denton et al. describe a method wherein an individual'shaplotypes are compared to a known database of haplotypes in the generalpopulation to predict clinical response to a treatment. In U.S. Pat. No.7,035,739 Schadt at al. describe a method is described wherein a geneticmarker map is constructed and the individual genes and traits areanalyzed to give a gene-trait locus data, which are then clustered as away to identify genetically interacting pathways, which are validatedusing multivariate analysis. In U.S. Pat. No. 6,025,128 Veltri et al.describe a method involving the use of a neural network utilizing acollection of biomarkers as parameters to evaluate risk of prostatecancer recurrence.

The cost of DNA sequencing is dropping rapidly, and in the near futureindividual genomic sequencing for personal benefit will become morecommon. Knowledge of personal genetic data will allow for extensivephenotypic predictions to be made for the individual. In order to makeaccurate phenotypic predictions high quality genetic data is critical,whatever the context. In the case of prenatal or pre-implantationgenetic diagnoses a complicating factor is the relative paucity ofgenetic material available. Given the inherently noisy nature of themeasured genetic data in cases where limited genetic material is usedfor genotyping, there is a great need for a method which can increasethe fidelity of, or clean, the primary data.

SUMMARY OF THE INVENTION

The system disclosed enables the cleaning of incomplete or noisy geneticdata using secondary genetic data as a source of information. While thedisclosure focuses on genetic data from human subjects, and morespecifically on as-yet not implanted embryos or implanted fetuses, itshould be noted that the methods disclosed apply to the genetic data ofa range of organisms, in a range of contexts. The techniques describedfor cleaning genetic data are most relevant in the context ofpre-implantation diagnosis during in-vitro fertilization, prenataldiagnosis in conjunction with amniocentesis, chorion villus biopsy, andfetal blood sampling, and non-invasive prenatal diagnosis, where a smallquantity of fetal genetic material is isolated from maternal blood. Thediagnoses may focus on inheritable diseases, increased likelihoods ofdefects or abnormalities, as well as making phenotype predictions forindividuals to enhance clinical and lifestyle decisions. The inventionaddresses the shortcomings of prior art that are discussed above.

In one aspect of the invention, methods make use of imperfect knowledgeof the genetic data of the mother and the father, together with theknowledge of the mechanism of meiosis and the imperfect measurement ofthe embryonic DNA, in order to reconstruct, in silico, the embryonic DNAat the location of key SNPs with a high degree of confidence. It isimportant to note that the parental data allows the reconstruction notonly of SNPs that were measured poorly, but also of insertions,deletions, and of SNPs or whole regions of DNA that were not measured atall.

The disclosed method is applicable in the context of in-vitrofertilization, where a very small number of blastomeres are availablefor genotyping from each embryo being considered for implantation. Thedisclosed method is equally applicable to the context of Non-InvasivePrenatal Diagnosis (NIPD) where only a small number of fetal cells, orfragments of fetal DNA, have been isolated from the mother's blood. Thedisclosed method is more generally applicable in any case where alimited amount of genetic data is available from the target genome, andadditional genetic data is available from individuals who aregenetically related to the target.

In one aspect of the invention, the fetal or embryonic genomic datawhich has been reconstructed can be used to detect if the cell isaneuploid, that is, if fewer or more than two of a particular chromosomeis present in a cell. A common example of this condition is trisomy-21,which gives rise to Down syndrome. The reconstructed data can also beused to detect for uniparental disomy, a condition in which two of agiven chromosome are present, both of which originate from one parent.This is done by creating a set of hypotheses about the potential statesof the DNA, and testing to see which one has the highest probability ofbeing true given the measured data. Note that the use of high throughputgenotyping data for screening for aneuploidy enables a single blastomerefrom each embryo to be used both to measure multiple disease-linked locias well as screen for aneuploidy.

In another aspect of the invention, the direct measurements of theamount of genetic material, amplified or unamplified, present at aplurality of loci, can be used to detect for aneuploides, or uniparentaldisomy. The idea behind this method is simply that the amount of geneticmaterial present during amplification is proportional to the amount ofgenetic information in the initial sample, and measuring these levels atmultiple loci will give a statistically significant result.

In another aspect of the invention, the disclosed method can cleangenetic material of the individual which has been contaminated byforeign DNA or RNA by identifying the data generated by extraneousgenetic materials. The spurious signals generated by the contaminatingDNA can be recognized in a manner similar to that way thatchromosome-wide anomalous signals generated by aneuploides can bedetected.

In another aspect of the invention, target cells are isolated, thegenetic data contained in those cells are amplified, and measurements ofmultiple SNPs are made using a combination of one or more of thefollowing techniques: PCR-based amplification techniques, PCR-basedmeasurement techniques, or detection techniques based on MOLECULARINVERSION PROBES, or microarrays such as the GENECHIP or TAQMAN systems.This genetic data is then used in the system described herein.

In another aspect of the invention, the genetic data of an individualcan be cleaned using diploid and haploid data from both parents.Alternately, haploid data from a parent can be simulated if diploid andhaploid data of the parent's parent can be measured. In another aspect,genetic data from any person of a known genetic relation to theindividual can be used to clean the data of the individual, includingparents, siblings, grandparents, offspring, cousins, uncles, aunts etc.

In another aspect of the invention, the target and/or relatedindividual's genetic data may be partly or wholly known in silico,obviating the need for some direct measurements. Portions of the geneticdata can be generated in silico by means of an informatics approachutilizing a hidden Markov model.

In another aspect of the invention, it is possible to estimate theconfidence one has in the determination of those SNPs.

Note that the techniques described herein are relevant both tomeasurements of genetic material in one, or a small number of cells, aswell as to measurements on smaller amounts of DNA such as that which canbe isolated from the mother's blood in the context of Non-invasivePrenatal Diagnosis (NIPD). Also note that this method can be equallyapplied to genomic data in silico, i.e. not directly measured fromgenetic material.

It will be recognized by a person of ordinary skill in the art, giventhe benefit of this disclosure, aspects and embodiments that mayimplement one or more of the systems, methods, and features, disclosedherein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1: an illustration of the concept of recombination in meiosis forgamete formation.

FIG. 2: an illustration of the variable rates of recombination along oneregion of Human Chromosome 1.

FIG. 3: determining probability of false negatives and false positivesfor different hypotheses.

FIG. 4: the results from a mixed female sample, all loci hetero.

FIG. 5: the results from a mixed male sample, all loci hetero.

FIG. 6: Ct measurements for male sample differenced from Ct measurementsfor female sample.

FIG. 7: the results from a mixed female sample; TAQMAN single dye.

FIG. 8: the results from a mixed male; TAQMAN single dye.

FIG. 9: the distribution of repeated measurements for mixed male sample.

FIG. 10: the results from a mixed female sample; qPCR measures.

FIG. 11: the results from a mixed male sample; qPCR measures.

FIG. 12: Ct measurements for male sample differenced from Ctmeasurements for female sample.

FIG. 13: detecting aneuploidy with a third dissimilar chromosome.

FIGS. 14A and 14B: an illustration of two amplification distributionswith constant allele dropout rate.

FIG. 15: a graph of the Gaussian probability density function of alpha.

FIG. 16: the general relationship diagram of the input data, thedatabase data, the algorithm and the output.

FIG. 17: a visual overview of how to derive P(H|M).

FIG. 18: a visual representation of the flow chart describing thealgorithm used to demonstrate the effectiveness of the cleaningalgorithm on simulated data.

FIG. 19: an illustration of a system that is configured to accomplishthe method disclosed herein, in the context of phenotype prediction ofembryos during IVF.

FIG. 20: a summary of different aneuploidy detection techniques

FIG. 21: an example of input data for the method described using SNPswith a low degree of cosegregation.

FIG. 22: an example of input data for the method described using SNPswith a high degree of cosegregation.

FIG. 23: an example of the output data for the input data shown in FIG.21.

FIG. 24: an example of the output data for the input data shown in FIG.23.

FIG. 25: the results of the preliminary simulation.

FIG. 26: the results of the full simulation of the method.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Conceptual Overview ofthe System

The goal of the disclosed system is to provide highly accurate genomicdata for the purpose of genetic diagnoses. In cases where the geneticdata of an individual contains a significant amount of noise, or errors,the disclosed system makes use of the similarities between genetic dataof related individuals, and the information contained in that secondarygenetic data, to clean the noise in the target genome. This is done bydetermining which segments of chromosomes were involved in gameteformation and where crossovers occurred during meiosis, and thereforewhich segments of the secondary genomes are expected to be nearlyidentical to sections of the target genome. In certain situations thismethod can be used to clean noisy base pair measurements, but it alsocan be used to infer the identity of individual base pairs or wholeregions of DNA that were not measured. In addition, a confidence can becomputed for each reconstruction call made. A highly simplifiedexplanation is presented first, making unrealistic assumptions in orderto illustrate the concept of the invention. A detailed statisticalapproach that can be applied to the technology of today is presentedafterward.

Another goal of the system is to detect abnormal numbers of chromosomes,sections of chromosomes, and origins of chromosomes. In genetic samplesthat are aneuploid, have unbalanced translocations, uniparental disomy,or other gross chromosomal abnormalities, the amount of genetic materialpresent at a plurality of loci can be used to determine the chromosomalstate of the sample. There are multiple approaches to this method, andseveral of them are described here. In some approaches, the amount ofgenetic material present in a sample is sufficient to directly detectaneuploides. In other approaches, the method for cleaning the geneticmaterial can be used to enhance the efficiency of detection ofchromosomal imbalances. A confidence can be computed for eachchromosomal call made.

Technical Description of the System A Simplified Example

FIG. 1 illustrates the process of recombination that occurs duringmeiosis for the formation of gametes in a parent. The chromosome 101from the individual's mother is shown in grey. The chromosome 102 fromthe individual's father is shown in white. During this interval, knownas Diplotene, during Prophase I of Meiosis, a tetrad of four chromatids103 is visible. Crossing over between non-sister chromatids of ahomologous pair occurs at the points known as recombination nodules 104.For the purpose of illustration, the example will focus on a singlechromosome, and three Single Nucleotide Polymorphisms (SNPs), which areassumed to characterize the alleles of three genes. For this discussionit is assumed that the SNPs may be measured separately on the maternaland paternal chromosomes. This concept can be applied to many SNPs, manyalleles characterized by multiple SNPs, many chromosomes, and to thecurrent genotyping technology where the maternal and paternalchromosomes cannot be individually isolated before genotyping.

Attention must be paid to the points of potential crossing over inbetween the SNPs of interest. The set of alleles of the three maternalgenes may be described as (a_(m1), a_(m2), a_(m3)) corresponding to SNPs(SNP₁, SNP₂, SNP₃). The set of alleles of the three paternal genes maybe described as (a_(p1), a_(p2), a_(p3)). Consider the recombinationnodules formed in FIG. 1, and assume that there is just onerecombination for each pair of recombining chromatids. The set ofgametes that are formed in this process will have gene alleles: (a_(m1),a_(m2), a_(p3)), (a_(m1), a_(p2), a_(p3)), (a_(p1), a_(m2), a_(p3)),(a_(p1), a_(p2), a_(m3)). In the case with no crossing over ofchromatids, the gametes will have alleles (a_(m1), a_(m2), a_(m3)),(a_(p1), a_(p2), a_(p3)). In the case with two points of crossing overin the relevant regions, the gametes will have alleles (a_(m1), a_(p2),a_(m3)), (a_(p1), a_(m2), a_(p3)). These eight different combinations ofalleles will be referred to as the hypothesis set of alleles, for thatparticular parent.

The measurement of the alleles from the embryonic DNA will be noisy. Forthe purpose of this discussion take a single chromosome from theembryonic DNA, and assume that it came from the parent whose meiosis isillustrated in FIG. 1. The measurements of the alleles on thischromosome can be described in terms of a vector of indicator variables:A=[A₁ A₂ A₃]^(T) where A₁=1 if the measured allele in the embryonicchromosome is a_(m1), A₁=−1 if the measured allele in the embryonicchromosome is a_(p1), and A₁=0 if the measured allele is neither a_(m1)or a_(p1). Based on the hypothesis set of alleles for the assumedparent, a set of eight vectors may be created which correspond to allthe possible gametes describe above. For the alleles described above,these vectors would be a₁=[1 1 1]^(T), a₂=[1 1 −1]^(T), a₃=[1 −1 1]^(T)a₄=[1 −1 −1]^(T), a₅=[−1 1 1]T, a₆=[−1 1−1]^(T), a₇=[−1 −1 1]^(T),a₈=[−1 −1 −1]^(T). In this highly simplified application of the system,the likely alleles of the embryo can be determined by performing asimple correlation analysis between the hypothesis set and the measuredvectors:

i*=arg max_(i) A ^(T) a _(i) ,i=1 . . . 8  (1)

Once i* is found, the hypothesis a_(i*) is selected as the most likelyset of alleles in the embryonic DNA. This process is then repeatedtwice, with two different assumptions, namely that the embryonicchromosome came from the mother or the father. That assumption whichyields the largest correlation A^(T)a_(i*) would be assumed to becorrect. In each case a hypothesis set of alleles is used, based on themeasurements of the respective DNA of the mother or the father. Notethat in a typical embodiment of the disclosed method, one measures alarge number of SNPs between those SNPs that are important due to theirassociation with particular disease phenotypes—these will be referred tothese as Phenotype-associated SNPs or PSNPs. TheNon-phenotype-associated SNPs (NSNPs) between the PSNPs may be chosena-priori (for example, for developing a specialized genotyping array) byselecting from the NCBI dbSNP database those RefSNPs that tend to differsubstantially between individuals. Alternatively, the NSNPs between thePSNPs may be chosen for a particular pair of parents because they differbetween the parents. The use of the additional SNPs between the PSNPsenables one to determine with a higher level of confidence whethercrossover occurs between the PSNPs. It is important to note that whiledifferent “alleles” are referred to in this notation, this is merely aconvenience; the SNPs may not be associated with genes that encodeproteins.

The System in the Context of Current Technology

In another more complex embodiment, the a-posteriori probability of aset of alleles is computed given a particular measurement, taking intoaccount the probability of particular crossovers. In addition, thescenario typical of microarrays and other genotyping technologies isaddressed where SNPs are measured for pairs of chromosomes, rather thanfor a single chromosome at a time. The measurements of the genotype atthe locus i for the embryonic, paternal and maternal chromosomes may becharacterized respectively by random variables representing the pairs ofSNP measurements (e_(1,i), e_(2,i)), (p_(1,i), p_(2,i)) and (m_(1,i),m_(2,i)). Since one cannot determine the presence of crossovers in thematernal and paternal chromosomes if all measurements are made as pairs,the method is modified: in addition to genotyping the fertilized embryosand paternal and maternal diploid tissue, one haploid cell from eachparent, namely, a sperm cell and an egg cell, is also genotyped. Themeasured alleles of the sperm cell are represented by p_(1,i), i=1 . . .N and the complementary alleles measured from the paternal diploidtissue by p_(2,i). Similarly, the measured alleles of the egg cell arerepresented by m_(1,i) and their complement in the mother's diploid cellby m_(2,i). These measurements provide no information on where theparental chromosomes crossed over in generating the measured sperm andegg cells. However, one can assume that the sequence of N alleles on theegg or sperm was created from the parental chromosomes by a small numberof, or no, crossovers. This is sufficient information to apply thedisclosed algorithm. A certain error probability is associated withcalling the paternal and maternal SNPs. The estimation of this errorprobability will vary based on the measurements made (p_(1,i), p_(2,i))and (m_(1,i), m_(2,i)) and the signal-to-noise ratio for the technologyused. Although these error probabilities can be uniquely computed foreach locus without affecting the disclosed method, the algebra issimplified here by assuming that the probabilities of correctly callingthe paternal and maternal SNPs are constant at p_(p) and p_(m)respectively.

Assume that a measurement is performed on the embryonic DNA which istermed measurement M. In addition, the notation is slightly modified sothat A is now a set and not a vector: A refers to a particularhypothesis about the combination (or set) of alleles derived from eachparent. The set of all possible combinations of alleles A from bothparents is denoted as S_(A). The goal is to determine the combination ofalleles (or that hypothesis) A ∈S_(A) with the highest a-posterioriprobability, given the measurement M:

A*=arg max_(A) P(A|M),∀A∈S _(A)  (2)

Using the law of conditional probabilities, P(A|M)=P(M|A)P(A)/P(M).Since P(M) is common for all different A's, the optimizing search can berecast as:

A*=arg max_(A) P(M/A)P(A),∀A∈S _(A)  (3)

Now consider the computation of P(M/A). Begin with a single locus i, andlet the hypothesis be that this locus on the embryo is derived from theparental SNPs p_(t,1,i) and m_(t,1,i), where the underscore _(t) is usedto denote the true value of these Parental SNPs, as opposed to themeasurements performed, p_(1,i) and m_(1,i), which may or may not becorrect. The true value of the embryonic SNPs is denoted as (e_(t,1,i),e_(t,2,i)). If hypothesis A is true, then (e_(t,1,i),e_(t,2,i))=(p_(t,1,i), m_(t,1,i)) or (m_(t,1,i), p_(t,1,i)). Since onecannot differentiate which of the measurements (e_(1,i), e_(2,i)) comesfrom which parent, both orders must be considered so the hypothesis setA=[(p_(t,1,i), m_(t,1,i)), (m_(t,1,i), p_(t,1,i))]. The probability of aparticular measurement M depends on the true values or the underlyingstates of the parental SNPs, namely (p_(t,1,i), p_(t,2,i)) and(m_(t,1,i), m_(t,2,i)). Since there are four SNPs, p_(t,1,i), p_(t,2,i),m_(t,1,i), m_(t,2,i), and each of these can assume the value of fournucleotide bases, A,C,T,G, there are 4⁴ or 256 possible states. Thealgorithm is illustrated for one state s₁ for which it is assumed thatp_(t,1,i)≠p_(t,2,i)≠m_(t,1,i)≠m_(t,2,i). From this explanation, it willbe clear how to apply the method to all 256 possible states, s_(k), k=1. . . 256. Assume a measurement M of embryonic SNPs (e_(1,i), e_(2,i))is performed, and the result e_(1,i)=p_(1,i), e_(2,i)=m_(1,I) isobtained. The a priori probability for this measurement given thathypothesis A and state s₁ are true is computed:

P(e _(1,i) =p _(1,i) ,e _(2,i) =m _(1,i) |A,s ₁)=P(e _(t,1,i) =p_(t,1,i) ,e _(t,2,i) =m _(t,1,i) |A,s ₁)P(e _(1,i) =p _(1,i) |e _(t,1,i)=p _(t,1,i))P(e _(t,2,i) =m _(t,1,i))+P(e _(t,1,i) =m _(t,1,i) ,e_(t,2,i) =p _(t,1,i) ,|A,s ₁)P(e _(1,i) =p _(1,i) |e _(t,1,i) =m_(t,1,i) ,p _(t,1,i) ≠m _(t,1,i))P(e _(2,i) =m _(1,i) |e _(t,2,i) ,p_(t,2,i) ≠m _(t,1,i))   (4)

Consider the first expressions in the first term and second term:P(e_(1,i)=p_(1,i),e_(2,i)=m_(1,i)|A,s₁)=P(e_(1,i)=m_(1,i),e_(2,i)=p_(1,i)|A,s₁)=0.5since the hypothesis A=[(p_(t,1,i), m_(t,1,i)), (m_(t,1,i), p_(t,1,i))]makes two orderings for the embryonic SNPs equally likely. Now considerthe second expression of the first term,P(e_(1,i)=p_(1,i)|e_(t,1,i)=p_(t,1,i)), the probability of measuringe_(1,i)=p_(1,i) given the assumption that embryonic SNP e_(t,1,i)actually is derived from paternal SNP p_(t,1,i). The probabilities forcorrectly measuring the paternal SNPs, maternal SNPs, and embryonic SNPsare p_(p), p_(m), and p_(e). Given the assumption (e_(t,1,i)=p_(t,1,i)),the measurement (e_(1,i)=p_(1,i)) requires either that both embryonicand paternal SNPs are correctly measured, or that both are incorrectlymeasured and they happen to be incorrectly measured as the samenucleotide (A,C,T, or G). So,P(e_(1,i)=p_(1,i)|e_(t,1,i)=p_(t,1,i))=p_(e)p_(p)+(1−p_(e))(1−p_(p))/3where it is assumed for simplicity that the probability of incorrectlycalling all of the four nucleotides is equally likely—the algorithm canbe easily modified to accommodate different probabilities of calling aparticular nucleotide (A,C,T,G) given a measurement on anotherparticular nucleotide. The same approach may be applied to the thirdexpression in the first term to obtainP(e_(2,i)=m_(1,i)|e_(t,2,i)=m_(t,1,i))=p_(e)p_(m)+(1-p_(e))(1-p_(m))/3.Now consider the second expression of the second term.P(e_(1,i)=p_(1,i)|e_(t,1,i)=m_(t,l,i), m_(t,1,i) p_(t,1,i)) requireseither that e_(1,i) or p_(1,i) be an incorrect measurement, or that bothbe incorrect measurements, so that the measured values happen to beequal: P(e_(1,i)=p_(1,i)|e_(t,1,i)=m_(t,1,I),m_(t,1,i)≠p_(t,1,i))=p_(e)(1-p_(p))/3+(1-p_(e))p_(p)/3+(1-p_(e))(1-p_(p))2/9.The same argument can be applied to the last expression of the secondterm to yield P(e_(2,i)=m_(1,i)|e_(t,2,i)=p_(t,2,i), m_(t,1,i)p_(t,2,i))=p_(e)(1-p_(m))/3+(1-p_(e))p_(m)/3+(1−p_(e))(1-p_(m))2/9. Now,combining all of these terms, and making the assumption—merely tosimplify the algebra—that p_(e)=p_(p)=p_(m)=p, one can compute:

P((e _(1,i) =p _(1,i) ,e _(2,i) =m _(1,i))|A,s ₁)=1/162(160p ⁴−160p³+96p ²−28p+13)  (5)

Although the computation will vary, a similar conceptual approach tothat described here would be used for all 256 possible states, s_(k),k=1 . . . 256. Computing P(e_(1,i)=p_(1,I), e_(2,i)=m_(1,I)|A,s_(i)) forall 256 states s_(i) and summing over the probability of each s_(i) oneobtains P(e_(1,i)=p_(1,I), e_(2,i)=m_(1,i)|A). In other words:

$\begin{matrix}{{P\left( {M\text{|}A} \right)} = {\sum\limits_{i = {1{\ldots 256}}}{{P\left( {{M\text{|}A},s_{i}} \right)}{P\left( s_{i} \right)}}}} & (6)\end{matrix}$

In order to compute the probabilities of each state s_(i), P(s_(i)), onemust treat all the separate alleles making up a state as separate eventssince they are on separate chromosomes, in other words:P(s_(i))=P(p_(t,1,i), p_(t,2,i), m_(t,1,i),m_(t,2,i))=P(p_(t,1,i))P(p_(t,2,i))P(m_(t,1,i))P(m_(t,2,i)). Bayesiantechniques may be applied to estimate the probability distribution forthe individual measurements. Every measurement of an allele on thematernal or paternal chromosomes at locus i may be treated as a cointoss experiment to measure the probability of this allele being aparticular value (A,C,T or G). These measurements are made on the adulttissue samples and may be treated as being totally reliable, even thoughpairs of alleles are measured for each SNP, and it is not possible todetermine which allele comes from which chromosome. Letw_(p,1,i)=P(p_(t,1,i)), corresponding to the probability of the SNP i onthe father's chromosome being value p_(t,1,i). In the followingexplanation, w is used instead of w_(p,1,i). Let the measurementsperformed on SNP i of the father's chromosome be characterized ascollecting data D. One can create a probability distribution for w, p(w)and update this after the data is measurement according to BayesTheorem: p(w|D)=p(w)p(D|w)/p(D). Assume n alleles of SNP i are observedand that the particular allele corresponding to w comes up h times—inother words, heads is observed h times. The probability of thisobservation can be characterized by the binomial distribution

$\begin{matrix}{{p\left( {D\text{|}w} \right)} = {\begin{pmatrix}n \\h\end{pmatrix}{w^{h}\left( {1 - w} \right)}^{n - h}}} & (7)\end{matrix}$

Before data is collected, assume there is a prior distribution p(w)which is uniform between 0 and 1. By applying the Bayes theorem, it isstraightforward to show that the resulting distribution for p(w|D) willbe a beta distribution of the form:

$\begin{matrix}{{{p\left( {w\text{|}D} \right)} = {\frac{1}{c}{w^{h}\left( {1 - w} \right)}^{n - h}}}{{{where}\mspace{14mu} c} = {\int_{0}^{1}{{w^{h}\left( {1 - w} \right)}^{n - h}{dw}}}}} & (8)\end{matrix}$

and c is a normalizing constant. However many times p(w|D) is thenupdated by applying Bayes theorem and new measurements, it will continueto have a beta distribution as above. The estimates of p(w) are updatedevery time a new measurement is collected. Note that there will be adifferent function p(w) for different races and different genders, usingthe same groupings used in the Hapmap project, since the probability ofdifferent alleles at particular SNPs is dependent on these groupings ofrace and gender. For the computation of P(s_(i)), each allele on eachchromosome will be associated with an estimated probabilitydistribution, namely p_(p,1,i)(w_(p,1,i)), p_(p,2,i)(w_(p,2,i)),p_(m,1,i)(w_(m,1,i)) and p_(m,2,i)(w_(m,2,i)). One may then compute themaximum a-posteriori (MAP) estimate for P(s_(i)) according to the MAPestimate for each of the individual distributions. For example, letw_(p,1,i)* be the argument that maximizes p_(p,1,i)(w_(p,1,i)). The MAPestimate of P(s_(i)) may be found according to

P(s _(i))_(MAP) =w _(p,1,i) *w _(p,2,i) *w _(m,1,i) *w _(m,2,i)*  (9)

Since there is a probability distribution for each w, one can alsocompute conservative estimates of the values P(s_(i)) to any specifiedconfidence level, by integrating over the probability distribution,rather than simply using the MAP estimates. It is possible to do this,for example, to conservatively estimate P(M|A) to within some confidencelevel. Whether a conservative estimate or a MAP estimate is used, theestimate of P(s_(i)) is continually refined for the computation ofP(M|A). In what follows, reference to the assumed state will beeliminated to simplify the notation, and state s_(i) is assumed for allexplanations of detailed computation. Bear in mind that in actualitythese calculations would be performed for each of 256 states and besummed over the probability of each.

The method of computing P(M|A) is now extended to multiple SNP loci,assuming that M represents the set of measurements of N pairs of SNPs onthe embryo, M=[M₁, . . . ,M_(N)]. Assume also that A represents the setof hypotheses for each SNP about which parental chromosomes contributedto that SNP, A=[A₁, . . . ,A_(N)]. Let S_(A′) represent the set of allother possible hypotheses that are different from A or are in the setA′. P(M|A) and P(M|A′) may be computed:

$\begin{matrix}{{{P\left( {M\text{|}A} \right)} = {\prod\limits_{i = {1\ldots \; N}}{P\left( {M_{i}\text{|}A_{i}} \right)}}},{{P\left( {M\text{|}A^{\prime}} \right)} = {\sum\limits_{A \in S_{A^{\prime}}}{{P(A)}{\prod\limits_{i = {1\ldots \; N}}{P\left( {M_{i}\text{|}A_{i}} \right)}}}}}} & (10)\end{matrix}$

Consider the computation of P(A). In essence, this is based on thelikelihood of particular crossovers occurring in the formation of thegametes that form the embryo. The probability of a particular allele setdepends on two factors, namely the probability that the embryonicchromosome comes from the mother or the father, and the probability of aparticular combination of crossovers. For a normal set of embryonicchromosomes that do not suffer from aneuploidy, the a-priori probabilitythat the embryonic chromosome comes from the mother or father is ˜50%and is consequently common for all A. Now, consider the probability of aparticular set of recombination nodes. The number of relevantrecombination sites R depends on the number of measured SNPS: R=N−1.Since the DNA segment constituting N NSNPs around the PSNP of interestwill be relatively short, crossover interference makes it highlyimprobable that two crossovers on the same chromosome can occur in oneregion. For reasons of computational efficiency this method assumes thatonly one crossover will occur in each region for each relevantchromosome, and this can occur at R possible sites. It will be obviousto someone skilled in the art how this method may be extended to includethe possibility where there are multiple crossovers in a given region.

Let the probability of a crossover in each region between SNPs bedenoted P_(r), r=1 . . . N−1. To first order, the probability of arecombination node in a region r between two SNPs is proportional to thegenetic distance between those SNPs (measured in cMorgans). However,much recent research has enabled a precise modeling of the probabilityof recombination between two SNP loci. Observations from sperm studiesand patterns of genetic variation show that recombination rates varyextensively over kilobase scales and that much recombination occurs inrecombination hotspots, and causes linkage disequilibrium to display ablock-like structure. The NCBI data about recombination rates on theHuman Genome is publicly available through the UCSC Genome AnnotationDatabase.

Various data sets can be used singly or in combination. Two of the mostcommon data sets are from the Hapmap Project and from the Perlegen HumanHaplotype Project. The latter is higher density; the former is higherquality. See FIG. 2 for the regional recombination rates from positions1,038,423 to 4,467,775 of chromosome 1, based on the HapMap Phase Idata, release 16a. These rates were estimated using the reversible-jumpMarkov Chain Monte Carlo (MCMC) method which is available in the packageLDHat. The state-space considered is the distribution of piece-wiseconstant recombination rate maps. The Markov chain explores thedistribution of the number and location of rate change-points, inaddition to the rates for each segment, 201. These results may be usedto generate an estimate of P_(r) by integrating over the recombinationrates times by the length of each constant segment between the SNPS. Thecumulative recombination rate over the nucleotides 202 is shown in FIG.2 in red.

Let C be a set of indicator variables c_(r) such that c_(r)=1 if acrossover occurred in region r and 0 otherwise. c₀=1 if no crossoversoccurred and 0 otherwise. Since it is assumed that only one crossovercan occur in a region of N SNPs, only one element of the set C isnon-zero. Hence, the probability of crossover represented by set C isfound to be:

$\begin{matrix}{P_{c} = {\left( {1 - {\sum\limits_{r = {{1\ldots \; N} - 1}}P_{r}}} \right)^{c_{0}}{\prod\limits_{r = 1}P_{r}^{c_{r}}}}} & (11)\end{matrix}$

In the hypothesis A about SNPs 1 . . . N, there are four potentialcrossovers of relevance. Namely, the potential crossovers in i) thepaternal chromosomes that formed the embryo (denoted by set C_(pe) ofindicator variables), ii) the paternal chromosomes that formed thesequenced sperm (set C_(ps)), iii) the maternal chromosomes that formedthe embryo (set C_(me)) and iv) the maternal chromosomes that formed thesequenced egg (set C_(ee)). Two additional assumptions are v) whetherthe first paternal embryonic SNP comes from p_(t,1,1) or p_(t,2,1) andvi) whether the first maternal embryonic SNP comes from m_(t,1,1) orm_(t,2,1). Since the probabilities of crossovers between SNPs is foundto differ between races and sexes, different crossover probabilitieswill be denoted as P_(p,r) for the paternal chromosomes, and P_(m,r) forthe maternal chromosomes. Therefore, the probability of a particularhypothesis A, which subsumes the sets C_(pe), C_(ps), C_(me), C_(ee) isexpressed as:

$\begin{matrix}{{P(A)} = {\frac{1}{4}\left( {1 - {\sum\limits_{r = {{1\ldots \; N} - 1}}P_{p,r}}} \right)^{c_{{pe},0}}{\prod\limits_{r = {{1\ldots \; N} - 1}}{{P_{p,r}^{c_{{ps},r}}\left( {1 - {\sum\limits_{r = {{1\ldots \; N} - 1}}P_{p,r}}} \right)}^{c_{{ps},0}}{\prod\limits_{r = {{1\ldots \; N} - 1}}{{P_{p,r}^{c_{{ps},r}}\left( {1 - {\sum\limits_{r = {{1\ldots \; N} - 1}}P_{m,r}}} \right)}^{c_{{me},0}}{\prod\limits_{r = {{1\ldots \; N} - 1}}{{P_{m,r}^{c_{{me},r}}\left( {1 - {\sum\limits_{r = {{1\ldots \; N} - 1}}P_{m,r}}} \right)}^{c_{{ee},0}}{\prod\limits_{r = {{1\ldots \; N} - 1}}P_{m,r}^{c_{{ee},r}}}}}}}}}}} & (12)\end{matrix}$

Now with the equations for determining P(A) and P(M/A), all the elementsnecessary to compute A* per Equation 3 above have been defined. Hence,it is possible to determine from the highly error-prone measurements ofthe embryonic SNPs where crossovers occurred, and to consequently cleanthe embryonic measurements with a high degree of confidence. It remainsto determine the degree of confidence in the best hypothesis A*. Todetermine this, it is necessary to find the odds ratio P(A*|M)/P(A*′|M).The tools have all been described above for this computation:

$\begin{matrix}{\frac{P\left( {A^{*}\text{|}M} \right)}{P\left( {A^{*\prime}\text{|}M} \right)} = {\frac{P\left( {A^{*}\text{|}M} \right)}{1 - {P\left( {A^{*}\text{|}M} \right)}} = {\frac{{P\left( A^{*} \right)}{P\left( {M\text{|}A^{*}} \right)}}{{P\left( A^{*\prime} \right)}{P\left( {M\text{|}A^{*\prime}} \right)}} = {\frac{{P\left( A^{*} \right)}{P\left( {M\text{|}A^{*}} \right)}}{\left( {1 - {P\left( A^{*} \right)}} \right){P\left( {M\text{|}A^{*\prime}} \right)}} = {OR}_{A^{*}}}}}} & (13)\end{matrix}$

The confidence in A* is then given as P(A*|M)=OR_(A)*/(1+OR_(A*)). Thiscomputation indicates the confidence in a particular hypothesis A*, butit does not indicate a confidence in a particular determination of aSNP. In order to compute the confidence in a determination of embryonicPSNP n, it is necessary to create the set of all hypotheses A that don'tchange the value of this SNP. This set will be denoted as S_(A*,n),which corresponds to all hypothesis that result in PSNP n on the embryohaving the same value as is predicted by hypothesis A*. Similarly,create a set S_(A*′,n) which corresponds to all hypothesis that resultin PSNP n having a different value to that predicted by hypothesis A*.Now, it is possible to compute the odds ratio of the probability thatthe SNP is correctly called versus the probability that the SNP isincorrectly called:

$\begin{matrix}{{OR}_{A^{*},n} = {\frac{\sum\limits_{A \in S_{A^{*},n}}{P\left( {A\text{|}M} \right)}}{\sum\limits_{A \in S_{A^{*\prime},n}}{P\left( {A\text{|}M} \right)}} = {\frac{\sum\limits_{A \in S_{A^{*},n}}{P\left( {A\text{|}M} \right)}}{1 - {\sum\limits_{A \in S_{A^{*},n}}{P\left( {A\text{|}M} \right)}}} = \frac{\sum\limits_{A \in S_{A^{*},n}}{{P(A)}{P\left( {M\text{|}A} \right)}}}{\sum\limits_{A \in S_{A^{*\prime},n}}{{P(A)}{P\left( {M\text{|}A} \right)}}}}}} & (14)\end{matrix}$

The confidence in the particular call of embryonic SNP n based on theodds ratio OR_(A*,n) can be computed as:

$\begin{matrix}{{P({correctlycalledSNPn})} = {{\sum\limits_{A \in S_{A^{*},n}}{P\left( {A\text{|}M} \right)}} = \frac{{OR}_{A^{*},n}}{1 + {OR}_{A^{*},n}}}} & (15)\end{matrix}$

Note that this technique could also be used to detect such defects asuniparental disomy (UPD) wherein two of the same chromosomes are fromthe same parent, while none of that chromosomes from the other parent ispresent. Upon attempting to deduce the crossovers in the parentchromosomes, there will be no hypothesis which adequately explains thedata with a high confidence, and if alternate hypotheses are allowedthat include the possibility of UPD, they will found to be more likely.

Bounding the Effect of Uncertainty in Recombination Rates and SNPMeasurement Reliability

The disclosed method depends on: assumptions about the probability ofrecombination between particular SNPs; assumptions about the probabilityof the correct measurement of each SNP on the embryonic, sperm, egg,paternal and maternal chromosomes; and assumptions about the likelihoodof certain alleles within different population groups. Consider each ofthese assumptions: the mechanism of recombination is not perfectlyunderstood and modeled, and the crossover probability has beenestablished to vary based on an individual's genotype. Furthermore, thetechniques by which the recombination rates are measured showsubstantial variability. For example, the package LDHat, whichimplements the reversible-jump Markov Chain Monte Carlo (MCMC) method,makes a set of assumptions and requires a set of user inputs about themechanism and characterization of recombination. These assumptions canaffect predicted recombination rates between SNPs as is evinced by thedifferent results obtained by various studies.

It is anticipated that the assumptions about recombination rates, out ofall assumptions listed above, will have the most impact on Equation 15.The computations described above should be based on the best estimatesof the probability for crossover between SNPS, P_(r). Thereafter,conservative estimates may be used for P_(r) using values at, forexample, the 95% confidence bounds for the recombination rates, in thedirection that reduces the confidence measure P(correctly called SNP n).The 95% confidence bounds may be derived from confidence data producedby various studies of recombination rates, and this may be corroboratedby looking at the level of discordance between published data fromdifferent groups using different methods.

Similarly, the 95% confidence bounds may be used for the estimates ofthe probability that each SNP is correctly called: p_(p), p_(m), p_(e).These numbers can be computed based on the actual measured arrayintensities included in the genotyping assay output files, combined withempirical data on the reliability of the measurement technique. Notethat those NSNPs for which these parameters p_(p), p_(m) and p_(e) arenot well established may be ignored. For example, since the diploidparental data is reliably measured, one may ignore NSNP measurements ofthe parents' haploid cells and on the embryo that do not correspond toany of the alleles on the relevant SNPs of the parent's diploid tissue.

Lastly, consider the assumptions about the likelihood of certain alleleswithin different population groups, which give rise to the computationP(s_(i)). These assumptions also will not have a large impact on thedisclosed method since the measurement of the parental diploid data isreliable i.e. direct measurement of the state s_(i) from the parentalsamples typically result in data with high confidence. Nonetheless, itis possible to use the probability distribution for each as described inEquation 8 in order to compute a confidence bound for the probability ofeach state P(s_(i)). As above, one may compute the 95% confidence boundfor each P(s_(i)) in the conservative direction that reduces confidencemeasure P(correctly called SNP n).

The determination of P(correctly called SNP n) will inform the decisionabout how many NSNPs need to be measured around each PSNP in order toachieve the desired level of confidence.

Note that there are different approaches to implementing the concept ofthe disclosed method, namely combining the measurement of the parent'sDNA, the measurement of the DNA of one or more embryos, and the a-prioriknowledge of the process of meiosis, in order to obtain a betterestimate of the embryonic SNPs. It will be clear to one skilled in theart, how similar methods can be applied when different subsets of thea-priori knowledge are known or not known, or known to a greater orlesser degree of certainty. For example, one can use the measurements ofmultiple embryos to improve the certainty with which one can call theSNPs of a particular embryo or to accommodate missing data from theparents. Note also that one does not need a PSNP of interest to bemeasured by the measurement technique. Even if that PSNPs is notdetermined by the measurement system, it can still be reconstructed witha high degree of confidence by the disclosed method.

Also note that once the points of crossover that occurred during meiosishave been determined, and the regions of the target genome have beenmapped to the pertinent regions of the parental DNA, it is possible toinfer not only the identity of individual SNPs of interest, but alsowhole regions of DNA that may be missing in the measured target genomedue to allele drop-out or other errors in measurement. It is alsopossible to measure insertions and deletions in the parental DNA, anduse the disclosed method to infer that they exist in the target DNA.

Various techniques may be used to improve the computational complexityof the disclosed algorithm described above. For example, one may only orpredominantly select those NSNPs that differ between the mother and thefather. Another consideration would be to only use NSNPs that are spacednearby the PSNPs to minimize the chance of crossovers occurring betweenthe NSNPs and PSNPs of interest. One could also use NSNPs that werespaced along the chromosome so as to maximize coverage of multiplePSNPs. Another consideration will be to initially use only a smallnumber of NSNPs to determine roughly where crossovers occurred, and withonly a limited degree of certainty. Additional NSNPs can then be used torefine the crossover model and increase the probability of correctlycalling the PSNPs. The number of crossover combinations to considerscales roughly as N^(C) where N is the number of SNPs and C is themaximum number of crossovers. Consequently, for C=4 it is possible toaccommodate roughly N=100 for each PSNP while remaining computationallytractable on a Pentium-IV processor. Using the approaches describedabove and other approaches for increased computational efficiency,N>100, C>4 can be easily accommodated. One such approach is describedbelow.

Note that there are many other approaches to make a call on a PSNP andgenerate an estimate of the probability that a PSNPs has been correctlydetermined, based on a particular set of embryonic data, parent data,and algorithm used, without changing the underlying concept. Thisprobability can be used for individual decision-making, and forimplementing a reliable service in the context of IVF or NIPGD.

Recursive Solution to the Genetic Data Cleaning Algorithm

Another embodiment of the invention involving an algorithm that scaleslinearly is described here. Given the limited nature of computationpower, the length of the computation may be a significant factor in theuse of the disclosed method. When running computations, any algorithmthat must compute certain values where the number of computations neededrises exponentially with the number of SNPs can become unwieldy. Asolution that involves a number of calculations that increase linearlywith the number of SNPs will always be preferred from a time standpointas the number of SNPs gets large. Below this approach is described.

A simple approach, which is to consider all possible hypotheses mustcontend with the running time being an exponential function in number ofSNPs. Suppose, as before, that measured data are a collection ofmeasured embryo, father and mother chromosome measurements on k SNPs,i.e. M={M₁, . . . ,M_(k)} whereM_(i)=(e_(1i),e_(2i),p_(1i),p_(2i),m_(1i),m_(2i)). As before, thehypotheses space is S_(H)={H¹, . . . ,H^(q)}={set of all thehypotheses}, where each hypothesis is of the format H^(j)=(H^(j) ₁, . .. H^(j) _(k)) where H^(j) _(i) is the “mini” hypothesis for snip i, ofthe format H^(j) _(i)=(p_(i)*,m_(i)*) where p_(i)*∈{p_(1i), p_(2i)} andm_(i)*∈{m_(1i), m_(2i)}. There are 4 different “mini” hypotheses H^(j)_(i), in particular:

H ^(j) _(i)1:(e _(1i) ,e _(2i))={(p _(1i) ,m _(1i)) or (m _(1i) ,p_(1i))};H ^(j) _(i)2:(e _(1i) ,e _(2i))={(p _(1i) ,m _(2i)) or (m _(2i),p _(1i))}

H ^(j) _(i)3:(e _(1i) ,e _(2i))={(p _(2i) ,m _(1i)) or (m _(1i) ,p_(2i))};H ^(j) _(i)4:(e _(1i) ,e _(2i))={(p _(2i) ,m _(2i)) or (m _(2i),p _(2i))}

The goal is to choose the most likely hypothesis H* as:

H*=arg max_(H∈S) _(H) P(H|M)=arg max_(H∈S) _(H) F(M,H) where functionF(M,H)=P(H|M)

There are 4^(k) different hypotheses in the space S^(H). By trying tofind the best hypothesis by exhaustively exploring the entire spaceS^(H), the necessary algorithm would be of exponential order in kO(exp(k)), where k is the number of SNPs involved. For large k, evenk>5, this is immensely slow and unpractical. Therefore, it is morepractical to resort to a recursive solution which solves the problem ofsize k as a function of the problem of size (k−1) in constant time. Thesolution shown here is of the linear order in k, O(k).

Recursive Solution Linear in the Number of SNPs

Begin with F(M,H)=P(H|M)=P(M|H)*P(H)/P(M). Then argmax_(H)F(M,H)=argmax_(H) P(M|H)*P(H) and the goal is to solve P(M|H)*P(H) inlinear time. Suppose that M_((s,k))=measurement on SNPs s to k,H(s,k)=hypothesis on SNPs s to k, and to simplify notation M(k,k)=M_(k),H_((k,k))=H_(k)=measurement and hypothesis on SNP k. As shown before:

${P\left( {M_{({1,k})}\text{|}H_{({1,k})}} \right)} = {{\prod\limits_{i = 1}^{k}\; {P\left( {M_{i}\text{|}H_{i}} \right)}} = {{{P\left( {M_{k}\text{|}H_{k}} \right)}^{*}{\prod\limits_{i = 1}^{k - 1}\; {P\left( {M_{i}\text{|}H_{i}} \right)}}} = {{P\left( {M_{k}\text{|}H_{k}} \right)}^{*}{P\left( {M_{({1,{k - 1}})}\text{|}H_{({1,{k - 1}})}} \right)}}}}$

Also:

F(M, H) = P(M|H)^(*)P(H) = P(M_((1, k))|H_((1, k)))^(*)P(H_((1, k))) = P(M_((1, k − 1))|H_((1, k − 1)))^(*)P(H_((1, k − 1)))^(*)P(M_(k)|H_(k))^(*)PF(H_(k − 1), H_(k))

and PC(H_(i−1),H_(i))=probability of crossover between H_(i−1), H_(i)

Finally, for k SNPs:

${P\left( H_{({1,k})} \right)} = {{1\text{/}4^{*}{\prod\limits_{i = 2}^{k}\; {{PF}\left( {H_{i - 1},H_{i}} \right)}}} = {{{{PF}\left( {H_{k - 1},H_{k}} \right)}^{*}1\text{/}4^{*}{\prod\limits_{i = 2}^{k - 1}\; {{PF}\left( {H_{i - 1},H_{i}} \right)}}} = {{{PF}\left( {H_{k - 1},H_{k}} \right)}^{*}{P\left( H_{({1,{k - 1}})} \right)}}}}$     where${{PF}\left( {H_{i - 1},H_{i}} \right)} = \left\{ \begin{matrix}{1 - {{PC}\left( {H_{i - 1},H_{i}} \right)}} & {H_{i - 1} = H_{i}} \\{{PC}\left( {H_{i - 1},H_{i}} \right)} & {H_{i - 1} \neq H_{i}}\end{matrix} \right.$

so in short F(M, H)=F(M_((1,k)), H_((1,k))))=F(M_((1,k-1)),H_((1,k-1))*P(M_(k)|H_(k))*PF(H_(k-1), H_(k)) i.e. we can reduce thecalculation of F on k SNPs to the calculation of F on k−1 SNPs.

For H=(H₁, . . . H_(k)),the hypothesis on k SNPs:

${\max\limits_{H}{F\left( {M,H} \right)}} = {\max\limits_{({H_{({1,{k - 1}})},H_{k}})}{F\left( {M,{\left( {H_{({1,{k - 1}})},H_{k}} \right) = {\max\limits_{H_{k}}{\max\limits_{H_{({1,{k - 1}})}}{F\left( {M,{\left( {H_{({1,{k - 1}})},H_{k}} \right) = {{\max\limits_{H_{k}}{{G\left( {M_{({1,k})},H_{k}} \right)}\mspace{79mu} {where}{G\left( {M_{({1,n})},H_{n}} \right)}}} = {\max\limits_{H_{({1,{n - 1}})}}{F\left( {M_{({1,n})}, {\left( {H_{({1,{n - 1}})},H_{n}} \right) = {{\max\limits_{H_{({1,{n - 1}})}}{{F\left( {M_{({1,{n - 1}})},H_{({1,{n - 1}})}} \right)}^{*} {P\left( {M_{n}\text{|}H_{n}} \right)}^{*} {{PF}\left( {H_{n - 1}, H_{n}} \right)}}} = {{{P\left( {M_{n}\text{|}H_{n}} \right)}^{*} {\max\limits_{H_{({1,{n - 1}})}}{{F\left( {M_{({1,{n - 1}})},H_{({1,{n - 1}})}} \right)}^{*} {{PF}\left( {H_{n - 1}, H_{n}} \right)}}}} = {{P\left( {M_{n}\text{|}H_{n}} \right)}^{*} {\max\limits_{H_{n - 1}}{\max\limits_{H_{({1,{n - 2}})}}{F\left( {M_{({1,{n - 1}})}, {{\left( {H_{({1,{n - 2}})},H_{n - 1}} \right)^{*} {{PF}\left( {H_{n - 1}, H_{n}} \right)}} = {{P\left( {M_{n}\text{|}H_{n}} \right)}^{*} {\max\limits_{H_{n - 1}}{{{PF}\left( {H_{n - 1},H_{n}} \right)}^{*} {G\left( {M_{({1,{n - 1}})}, H_{n - 1}} \right)}}}}}} \right.}}}}}}}} \right.}}}}} \right.}}}}} \right.}}$

To summarize this:

${\max\limits_{H}{F\left( {M,H} \right)}} = {\max\limits_{H_{n}}{G\left( {M_{({1,k})},H_{k}} \right)}}$

where G can be found recursively: for n=2, . . . ,k

${G\left( {M_{({1,n})},H_{n}} \right)} = {{P\left( {M_{n}\text{|}H_{n}} \right)}^{*}{\max\limits_{H_{n - 1}}\left\lfloor {{{PF}\left( {H_{n - 1},H_{n}} \right)}^{*}{G\left( {M_{({1,{n - 1}})},H_{n - 1}} \right)}} \right\rfloor}}$and G(M_((1, 1)), H₁) = 0.25^(*)P(M₁|H₁)

The algorithm is as follows:

For n=1: Generate 4 hypotheses H₁i, calculate G(M₁|H₁i) for i=1, . . .,4.For n=2: Generate 4 hypothesis for H₂i, calculate G(M_((1,2))|H₂i),i=1,. . . ,4 in constant time using the formula:

${G\left( {M_{({1,2})},{H_{2}i}} \right)} = {{P\left( M_{2} \middle| {H_{2}i} \right)}*{\max\limits_{{j = 1},\; \ldots \;,4}\left\lbrack {{{PF}\left( {H_{1},{H_{2}i}} \right)}*{G\left( {M_{1},{H_{1}j}} \right)}} \right\rbrack}}$

For n=k: Generate 4 hypothesis for H_(k)i, make G(M_((1,k))|H_(k)i),i=1, . . . ,4 by

${G\left( {M_{{1,k})},{H_{k}i}} \right)} = {{P\left( M_{k} \middle| {H_{k}i} \right)}*{\max\limits_{{j = 1},\; \ldots \;,4}\left\lfloor {{{PF}\left( {H_{k - 1},{H_{k}i}} \right)}*{G\left( {M_{({1,{k - 1}})},{H_{k - 1}j}} \right)}} \right\rfloor}}$

At any time there are only 4 hypotheses to remember and a constantnumber of operations. So the algorithm is linear in k, number of SNPs,as opposed to exponential.

Solving P(M) in Linear Time

It is not necessary to solve for P(M) to get the best hypothesis, sinceit is constant for all H. But in order to get the actual, meaningfulnumber for the conditional probability P(H|M)=P(M|H)*P(H)/P(M), it isalso necessary to derive P(M). As above, we can write:

$\mspace{20mu} \begin{matrix}{{P(M)} = {{P\left( M_{({1,k})} \right)} = {\sum\limits_{H_{({1,k})}}{{P\left( M_{({1,k})} \middle| H_{({1,k})} \right)}*{P\left( H_{({1,k})} \right)}}}}} \\{= {\sum\limits_{H_{k}}{{P\left( M_{K} \middle| H_{k} \right)}{\sum\limits_{H_{({1,{k - 1}})}}{{P\left( M_{({1,{k - 1}})} \middle| H_{({1,{k - 1}})} \right)}*}}}}} \\{{{P\left( H_{({1,{k - 1}})} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}} \\{= {\sum\limits_{H_{k}}{{P\left( M_{K} \middle| H_{k} \right)}*{W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)}}}}\end{matrix}$   where${W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)} = {\sum\limits_{H_{({1,{k - 1}})}}{{P\left( M_{({1,{k - 1}})} \middle| H_{({1,{k - 1}})} \right)}*{P\left( H_{({1,{k - 1}})} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}}$

We can solve for W(M,H) by recursion:

$\begin{matrix}{{W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)} = {\sum\limits_{H_{({1,{k - 1}})}}{{P\left( M_{({1,{k - 1}})} \middle| H_{({1,{k - 1}})} \right)}*{P\left( H_{({1,{k - 1}})} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}}} \\{= {\sum\limits_{H_{k - 1}}{{P\left( M_{k - 1} \middle| H_{k - 1} \right)}{\sum\limits_{H_{({1,{k - 2}})}}{{P\left( M_{({1,{k - 2}})} \middle| H_{({1,{k - 2}})} \right)}*}}}}} \\{{{P\left( H_{({1,{k - 2}})} \right)}*{{PF}\left( {H_{k - 2},H_{k - 1}} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}} \\{= {\sum\limits_{H_{k - 1}}{{P\left( M_{k - 1} \middle| H_{k - 1} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}*{W\left( M_{({1,{k - 2}})} \middle| H_{k - 1} \right)}}}}\end{matrix}$

so in short, problem of size k is reduced to the problem of size (k−1)by

${W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)} = {\sum\limits_{H_{k - 1}}{{P\left( M_{k - 1} \middle| H_{k - 1} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}*{W\left( M_{({1,{k - 2}})} \middle| H_{k - 1} \right)}}}$$\mspace{20mu} {{{and}\mspace{20mu} {W\left( M_{({1,1})} \middle| H_{2} \right)}} = {\sum\limits_{H_{1}}{{P\left( M_{1} \middle| H_{1} \right)}*0.25*{{PF}\left( {H_{1},H_{2}} \right)}}}}$

As before, for n=2:k, generate W(2), . . . ,W(K)=W(M_((1,k-1))|H_(k))recursively, until finally, it is possible to derive

${P(M)} = {\sum\limits_{H_{k}}{{P\left( M_{K} \middle| H_{k} \right)}*{{W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)}.}}}$

At each level there are only four different hypotheses Hk, so thealgorithm is again linear in the number of SNPs k.

Individual SNP Confidence in Linear Time

Once the best hypothesis H*=(H₁*, . . . ,H_(k)*), has been computed, itnow may be desired to derive the confidence in the final answer for eachSNP, namely P(H_(i)*|M), for i=1, . . . ,k. As beforeP(H_(i)*|M)=P(M|H_(i)*)P(H_(i)*)/P(M)=W(H_(i)*,M)/P(M), where P(M) isalready known.

${{W\left( {M,H_{i}^{*}} \right)} = {{\sum\limits_{H,{H_{i} = H_{i}^{*}}}{{P\left( M \middle| H \right)}*{P(H)}}} = {\sum\limits_{H = {({H_{({1,{i - 1}})},H_{i}^{*},H_{({{i + 1},k})}})}}{{P\left( M \middle| H \right)}*{P(H)}}}}},$

i.e. hypothesis H has been broken up to the hypothesis on first i−1SNPs, ith SNP, and hypothesis on the i+1 to kth SNP. As before:

${P\left( M_{({1,k})} \middle| H_{({1,k})} \right)} = {{\prod\limits_{j = 1}^{k}\; {P\left( M_{j} \middle| H_{j} \right)}} = {{\prod\limits_{j = 1}^{i - 1}\; {{P\left( M_{j} \middle| H_{j} \right)}*{P\left( M_{i} \middle| H_{i}^{*} \right)}*{\prod\limits_{j = {i + 1}}^{k}\; {P\left( M_{j} \middle| H_{j} \right)}}}} = {{P\left( M_{({1,{i - 1}})} \middle| H_{({1,{j - 1}})} \right)}*{P\left( M_{i} \middle| H_{i}^{*} \right)}*{P\left( M_{({{i + 1},k})} \middle| H_{({{i + 1},k})} \right)}\mspace{14mu} {and}}}}$${P\left( H_{({1,k})} \right)} = {{{1/4}*{\prod\limits_{j = 2}^{k}\; {{PF}\left( {H_{j - 1},H_{j}} \right)}}} = {{{1/4}*{\prod\limits_{j = 2}^{i - 1}\; {{{PF}\left( {H_{j - 1},H_{j}} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{\prod\limits_{j = {j + 2}}^{k}\; {{PF}\left( {H_{j - 1},H_{j}} \right)}}}}} = {{1/4}*{T\left( H_{({1,{j - 1}})} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{T\left( H_{({{i + 1},k})} \right)}}}}$${{So}\mspace{14mu} {P\left( H_{({1,k})} \right)}} = {{{1/4}*{T\left( H_{({1,k})} \right)}} = {{{1/4}*{T\left( H_{({1,{i - 1}})} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{T\left( H_{({{i + 1},k})} \right)}\mspace{14mu} {where}\mspace{20mu} {T\left( H_{({1,k})} \right)}} = {\prod\limits_{j = 21}^{k}\; {{{PF}\left( {H_{j - 1},H_{j}} \right)}.}}}}$

From this it is possible to show that

${W\left( {M_{({1,k})},H_{i}^{*}} \right)} = {{\sum\limits_{H,{H_{i} = H_{i}^{*}}}{{P\left( M \middle| H \right)}*{P(H)}}} = {{{\sum\limits_{H,{H_{i} = H_{i}^{*}}}{{P\left( M \middle| H \right)}*{1/4}*{T(H)}{P\left( M_{({1,{i - 1}})} \middle| H_{({1,{i - 1}})} \right)}*{P\left( M_{i} \middle| H_{i}^{*} \right)}*{P\left( M_{({{i + 1},k})} \middle| H_{({{i + 1},k})} \right)}}}*={\sum\limits_{H = {({H_{({1,{i - 1}})},H_{i}^{*},H_{({{i + 1},k})}})}}{{1/4}*{T\left( H_{({1,{i - 1}})} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}*{T\left( H_{({{i + 1},k})} \right)}}}} = {{4*{P\left( M_{i} \middle| H_{i}^{*} \right)}*\left( {\sum\limits_{H_{i - 1}}{{P\left( M_{({1,{i - 1}})} \middle| H_{({1,{i - 1}})} \right)}*{1/4}*{T\left( H_{({1,{i - 1}})} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}}} \right)*\left( {\sum\limits_{H_{i + 1}}{{P\left( M_{({{i + 1},k})} \middle| H_{({{i + 1},k})} \right)}*{1/4}*{T\left( H_{({{i + 1},k})} \right)}*{{PF}\left( {H_{i}^{*},H_{i + 1}} \right)}}} \right)} = {4*{P\left( M_{i} \middle| H_{i}^{*} \right)}*\left( {\sum\limits_{H_{i - 1}}{{W\left( {M_{({1,{i - 1}})},H_{i - 1}} \right)}*{{PF}\left( {H_{i - 1},H_{i}^{*}} \right)}}} \right)*\left( {\sum\limits_{H_{i + 1}}{{W\left( {M_{({{i +},k})},H_{i + 1}} \right)}*{{PF}\left( {H_{i}^{*},H_{i + 1}} \right)}}} \right)}}}}$

Again a case of size k has been reduced to two pieces of smaller size,albeit a bit more complicated than before. Each of the pieces can becalculated as

${W\left( {M_{({1,n})},H_{n}} \right)} = {{P\left( M_{n} \middle| H_{n} \right)}*\left( {\sum\limits_{H_{n - 1}}{{W\left( {M_{({1,{n - 1}})},H_{n - 1}} \right)}*{{PF}\left( {H_{n - 1},H_{n}} \right)}}} \right)}$${W\left( {M_{({m,k})},H_{m}} \right)} = {{P\left( M_{m} \middle| H_{m} \right)}*\left( {\sum\limits_{H_{m + 1}}{{W\left( {M_{({{m + 1},k})},H_{m + 1}} \right)}*{{PF}\left( {H_{m},H_{m + 1}} \right)}}} \right)}$

So the algorithm will, for n=1, . . . ,k, m=k, . . . 1, for each of 4different H_(n), H_(m) calculate W(M_((1,n)),H_(n)),W(M_((m,k)),H_(m))and then combine them as needed to calculate W(M_((1,k)), H_(i)*), fori=1, . . . ,k. The number of operations is still linear in k.Application of the Disclosed Method to Embryonic Data when a Smaller orDifferent Set of Data is Available

In one embodiment of the system it is only necessary to make use ofdiploid data from one parent (presumably the mother), with or withouthaploid data from either or both of the parents, and when that data isknown to a greater or lesser degree of certainty. For example it isexpected that, given the grueling nature of egg donation, there will beoccasions when maternal haploid data is not readily available. It willbe clear to one skilled in the art, after reading this description, howthe statistical methods for computing the likelihood of a particular SNPcan be modified given a limited data set.

An alternative approach uses data from more distant relatives to make upfor missing diploid or haploid data of one or both parents. For example,since it is known that one set of an individual's chromosomes come fromeach of his or her parents, diploid data from the maternal grandparentscould be used to partially reconstruct missing or poorly measuredmaternal haploid data.

Note the recursive nature of this method: given the naturally noisymeasurement of single cell parental haploid data, along with the diploidand/or haploid data of the appropriate grandparents, the disclosedmethod could be used to clean the parental haploid data, which in turnwill provide more accurate genotyping of the embryo. It should beobvious to one skilled in the arts how to modify the method for use inthese cases.

It is preferable to use more information rather than less, as this canincrease the chances of making the right call at a given SNP, and canincrease the confidence in those calls. This must be balanced with theincreasing complexity of the system as additional techniques and sourcesof data are used. There are many sources of additional information, aswell as techniques available to use the information to augment the data.For example, there are informatics based approaches which take advantageof correlations which can be found in Hapmap data, or other repositoriesof genomic data. In addition there are biological approaches which canallow for the direct measurement of genetic data that otherwise wouldneed to be recreated in silico. For example, haploid data otherwiseunavailable may be measureable by extracting individual chromosomes fromdiploid cells using flow cytometry techniques to isolate fluorescentlytagged chromosomes. Alternately, one may use cell fusion to createmonoallelic hybrid cells to effect diploid to haploid conversion.

Application of the Disclosed Method to Selecting which Embryo is Likelyto Implant

In one embodiment, the system can be used to determine the likelihood ofan embryo to implant in the mother and develop into a baby. To theextent that the likelihood of the embryo implanting is determined bySNPs of the embryo, and/or their relation to SNPs of the mother, thedisclosed method will be important in helping the selection of embryos,based on making a reliable prediction of which will successfully implantbased on the clean SNP data. To best predict the likelihood it will benecessary to take into account the determined genotype of the embryopossibly combined with the levels of gene expression in the embryo, thelevels of gene expression in the mother, and/or the determined genotypeof the mother.

In addition, it is well known that aneuploid embryos are less likely toimplant, less likely to result in a successful pregnancy, and lesslikely to result in a healthy child. Consequently, screening foraneuploides is an important facet to selecting the embryo that is mostlikely to result in a successful outcome. More detail on this approachis given below.

Deducing Parental Haploid Data

In one embodiment of the method, it may be necessary to deduce parentalhaplotypes, given detailed knowledge of the diploid data of a parent.There are multiple ways this can be done. In the simplest case,haplotypes have already been inferred by molecular assay of singlehaploid cells of a direct relation (mother, father, son or daughter). Inthis case, it is a trivial matter to one skilled in the art to deducethe sister haplotype by subtracting the known haplotype from the diploidgenotype measured by molecular assay. For example, if a particular locusis heterozygous, an unknown parental haplotype is the opposite allelefrom the known parental haplotype.

In another case, the noisy haploid data of the parent may be known frommolecular biological haplotyping of individual parental haploid cells,such as a sperm cell, or from individual chromosomes, which may beisolated by various methods including magnetic beads and flow cytometry.In this case, the same procedure can be used as above, except that thedetermined haplotype will be as noisy as the measured haplotype.

There are also methods for deducing haploid data sets directly fromdiploid data, using statistical methods that utilize known haplotypeblocks in the general population (such as those created for the publicHapmap project). A haplotype block is essentially a series of correlatedalleles that occur repeatedly in a variety of populations. Since thesehaplotype blocks are often ancient and common, they may be used topredict haplotypes from diploid genotypes. The parents' inferredhaplotype blocks can then be used as input for the method describedherein to clean the noisy data from the embryos. Publicly availablealgorithms that would accomplish this task include an imperfectphylogeny approach, Bayesian approaches based on conjugate priors, andpriors from population genetics. Some of these algorithms use hiddenMarkov models. One study used public trio and unrelated individual datato demonstrate that these algorithms perform with error rates as low as0.05% across 1 MB of sequence. However, as expected, accuracy is lowerfor individuals with rare haplotype blocks. In one estimate,computational methods failed to phase as many as 5.1% of loci with minorallele frequency of 20%.

In one embodiment of the invention, genetic data from multipleblastomeres taken from different embryos during an IVF cycle is used toinfer the haplotype blocks of the parents with greater reliability.

Techniques for Screening for Aneuploidy Using High and Medium ThroughputGenotyping

In one embodiment of the system the measured genetic data can be used todetect for the presence of aneuploides and/or mosaicism in anindividual. Disclosed herein are several methods of using medium orhigh-throughput genotyping to detect the number of chromosomes or DNAsegment copy number from amplified or unamplified DNA from tissuesamples. The goal is to estimate the reliability that can be achieved indetecting certain types of aneuploidy and levels of mosaicism usingdifferent quantitative and/or qualitative genotyping platforms such asABI TAQMAN, MIPS, or Microarrays from ILLUMINA, AGILENT and AFFMETRIX.In many of these cases, the genetic material is amplified by PCR beforehybridization to probes on the genotyping array to detect the presenceof particular alleles. How these assays are used for genotyping isdescribed elsewhere in this disclosure.

Described below are several methods for screening for abnormal numbersof DNA segments, whether arising from deletions, aneuploides and/ormosaicism. The methods are grouped as follows: (i) quantitativetechniques without making allele calls; (ii) qualitative techniques thatleverage allele calls; (iii) quantitative techniques that leverageallele calls; (iv) techniques that use a probability distributionfunction for the amplification of genetic data at each locus. Allmethods involve the measurement of multiple loci on a given segment of agiven chromosome to determine the number of instances of the givensegment in the genome of the target individual. In addition, the methodsinvolve creating a set of one or more hypotheses about the number ofinstances of the given segment; measuring the amount of genetic data atmultiple loci on the given segment; determining the relative probabilityof each of the hypotheses given the measurements of the targetindividual's genetic data; and using the relative probabilitiesassociated with each hypothesis to determine the number of instances ofthe given segment. Furthermore, the methods all involve creating acombined measurement M that is a computed function of the measurementsof the amounts of genetic data at multiple loci. In all the methods,thresholds are determined for the selection of each hypothesis H_(i)based on the measurement M, and the number of loci to be measured isestimated, in order to have a particular level of false detections ofeach of the hypotheses.

The probability of each hypothesis given the measurement M isP(H_(i)|M)=P(M|H_(i))P(H_(i))/P(M). Since P(M) is independent of H_(i),we can determine the relative probability of the hypothesis given M byconsidering only P(M|H_(i))P(H_(i)). In what follows, in order tosimplify the analysis and the comparison of different techniques, weassume that P(H_(i)) is the same for all {H_(i)}, so that we can computethe relative probability of all the P(H_(i)|M) by considering onlyP(M|H_(i)). Consequently, our determination of thresholds and the numberof loci to be measured is based on having particular probabilities ofselecting false hypotheses under the assumption that P(H_(i)) is thesame for all {H_(i)}. It will be clear to one skilled in the art afterreading this disclosure how the approach would be modified toaccommodate the fact that P(H_(i)) varies for different hypotheses inthe set {H_(i)}. In some embodiments, the thresholds are set so thathypothesis H_(i), is selected which maximizes P(H_(i)|M) over all i.However, thresholds need not necessarily be set to maximize P(H_(i)|M),but rather to achieve a particular ratio of the probability of falsedetections between the different hypotheses in the set {H_(i)}.

It is important to note that the techniques referred to herein fordetecting aneuploides can be equally well used to detect for uniparentaldisomy, unbalanced translocations, and for the sexing of the chromosome(male or female; XY or XX). All of the concepts concern detecting theidentity and number of chromosomes (or segments of chromosomes) presentin a given sample, and thus are all addressed by the methods describedin this document. It should be obvious to one skilled in the art how toextend any of the methods described herein to detect for any of theseabnormalities.

The Concept of Matched Filtering

The methods applied here are similar to those applied in optimaldetection of digital signals. It can be shown using the Schwartzinequality that the optimal approach to maximizing Signal to Noise Ratio(SNR) in the presence of normally distributed noise is to build anidealized matching signal, or matched filter, corresponding to each ofthe possible noise-free signals, and to correlate this matched signalwith the received noisy signal. This approach requires that the set ofpossible signals are known as well as the statistical distribution—meanand Standard Deviation (SD)—of the noise. Herein is described thegeneral approach to detecting whether chromosomes, or segments of DNA,are present or absent in a sample. No differentiation will be madebetween looking for whole chromosomes or looking for chromosome segmentsthat have been inserted or deleted. Both will be referred to as DNAsegments. It should be clear after reading this description how thetechniques may be extended to many scenarios of aneuploidy and sexdetermination, or detecting insertions and deletions in the chromosomesof embryos, fetuses or born children. This approach can be applied to awide range of quantitative and qualitative genotyping platformsincluding TAQMAN, qPCR, ILLUMINA Arrays, AFFMETRIX Arrays, AGILENTArrays, the MIPS kit etc.

Formulation of the General Problem

Assume that there are probes at SNPs where two allelic variations occur,x and y. At each locus i, i=1 . . . N, data is collected correspondingto the amount of genetic material from the two alleles. In the TAQMANassay, these measures would be, for example, the cycle time, C_(t), atwhich the level of each allele-specific dye crosses a threshold. It willbe clear how this approach can be extended to different measurements ofthe amount of genetic material at each locus or corresponding to eachallele at a locus. Quantitative measurements of the amount of geneticmaterial may be nonlinear, in which case the change in the measurementof a particular locus caused by the presence of the segment of interestwill depend on how many other copies of that locus exist in the samplefrom other DNA segments. In some cases, a technique may require linearmeasurements, such that the change in the measurement of a particularlocus caused by the presence of the segment of interest will not dependon how many other copies of that locus exist in the sample from otherDNA segments. An approach is described for how the measurements from theTAQMAN or qPCR assays may be linearized, but there are many othertechniques for linearizing nonlinear measurements that may be appliedfor different assays.

The measurements of the amount of genetic material of allele x at loci 1. . . N is given by data d_(x)=[d_(x1) . . . d_(xN)]. Similarly forallele y, d_(y)=[d_(y1) . . . d_(yN)]. Assume that each segment j hasalleles a_(j)=[a_(j1) . . . a_(jN)] where each element a_(ji) is eitherx or y. Describe the measurement data of the amount of genetic materialof allele x as d_(x)=s_(x)+ν_(x) where s, is the signal and ν_(x) is adisturbance. The signal s_(x)=[f_(x)(a₁₁, . . . ,a_(JI)) . . .f_(x)(a_(JN), . . . , a_(JN))] where f_(x) is the mapping from the setof alleles to the measurement, and J is the number of DNA segmentcopies. The disturbance vector ν_(x) is caused by measurement error and,in the case of nonlinear measurements, the presence of other geneticmaterial besides the DNA segment of interest. Assume that measurementerrors are normally distributed and that they are large relative todisturbances caused by nonlinearity (see section on linearizingmeasurements) so that u_(xi)≈n_(xi) where n_(xi) has variance σ_(xi) ²and vector n_(x) is normally distributed ˜N(0,R), R=E(n_(x),n_(x) ^(T)).Now, assume some filter h is applied to this data to perform themeasurement m_(x)=h^(T)d_(x)=h^(T)s_(x)+h^(T)ν_(x). In order to maximizethe ratio of signal to noise (h^(T)s_(x)/h^(T)n_(x)) it can be shownthat h is given by the matched filter h=μR⁻¹s_(x) where μ is a scalingconstant. The discussion for allele x can be repeated for allele y.

Method 1a: Measuring Aneuploidy or Sex by Quantitative Techniques thatdo not Make Allele Calls when the Mean and Standard Deviation for EachLocus is Known

Assume for this section that the data relates to the amount of geneticmaterial at a locus irrespective of allele value (e.g. using qPCR), orthe data is only for alleles that have 100% penetrance in thepopulation, or that data is combined on multiple alleles at each locus(see section on linearizing measurements) to measure the amount ofgenetic material at that locus. Consequently, in this section one mayrefer to data d_(x) and ignore d_(y). Assume also that there are twohypotheses: h₀ that there are two copies of the DNA segment (these aretypically not identical copies), and h₁ that there is only 1 copy. Foreach hypothesis, the data may be described asd_(xi)(h₀)=s_(xi)(h₀)+n_(xi) and d_(xi)(h₁)=s_(xi)(h₁)+n_(xi)respectively, where s_(xi)(h₀) is the expected measurement of thegenetic material at locus i (the expected signal) when two DNA segmentsare present and s_(xi)(h₁) is the expected data for one segment.Construct the measurement for each locus by differencing out theexpected signal for hypothesis h₀: m_(xi)=d_(xi)−s_(xi)(h₀). If h₁ istrue, then the expected value of the measurement isE(m_(xi))=s_(xi)(h₁)−s_(xi)(h₀). Using the matched filter conceptdiscussed above, set h=(1/N)R⁻¹(s_(xi)(h₁)−s_(xi)(h₀)). The measurementis described asm=h^(T)d_(x)=(1/N)Σ_(i=1 . . . N)((s_(xi)(h₁)−s_(xi)(h₀))/σ_(xi)²)m_(xi).

If h₁ is true, the expected value ofE(mlh₁)=m₁=(1/N)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ² and thestandard deviation of m is σ_(mlh1)²=(1/N²)Σ_(i=1 . . . N)((s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ⁴)σ_(xi)²=(1/N²)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ².

If h₀ is true, the expected value of m is E(mlh₀)=m₀=0 and the standarddeviation of m is again σmlh₀²=(1/N²)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ².

FIG. 3 illustrates how to determine the probability of false negativesand false positive detections. Assume that a threshold t is set half-waybetween m₁ and m₀ in order to make the probability of false negativesand false positives equal (this need not be the case as is describedbelow). The probability of a false negative is determined by the ratioof (m₁−t)/σmlh₁=(m₁−m₀)/(2σ_(mlh1)). “5-Sigma” statistics may be used sothat the probability of false negatives is 1-normcdf(5,0,1)=2.87e−7. Inthis case, the goal is for (m₁−m₀)/(2σ_(mlh0))>5 or 10sqrt((1/N²)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi)²)<(1/N)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))2/σ_(xi) ² orsqrt(Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ²)>10. In order tocompute the size of N, Mean Signal to Noise Ratio can be computed fromaggregated data:MSNR=(1/N)Σ_(i=1 . . . N)(s_(xi)(h₁)−s_(xi)(h₀))²/σ_(xi) ². N can thenbe found from the inequality above: sqrt(N)·sqrt(MSNR)>10 or N>100/MSNR.

This approach was applied to data measured with the TAQMAN Assay fromApplied BioSystems using 48 SNPs on the X chromosome. The measurementfor each locus is the time, C_(t), that it takes the die released in thewell corresponding to this locus to exceed a threshold. Sample 0consists of roughly 0.3 ng (50 cells) of total DNA per well of mixedfemale origin where subjects had two X chromosomes; sample 1 consistedof roughly 0.3 ng of DNA per well of mixed male origin where subject hadone X chromosome. FIG. 4 and FIG. 5 show the histograms of measurementsfor samples 1 and 0. The distributions for these samples arecharacterized by m₀=29.97; SD₀=1.32, m₁=31.44, SD₁=1.592. Since thisdata is derived from mixed male and female samples, some of the observedSD is due to the different allele frequencies at each SNP in the mixedsamples. In addition, some of the observed SD will be due to the varyingefficiency of the different assays at each SNP, and the differing amountof dye pipetted into each well. FIG. 6 provides a histogram of thedifference in the measurements at each locus for the male and femalesample. The mean difference between the male and female samples is 1.47and the SD of the difference is 0.99. While this SD will still besubject to the different allele frequencies in the mixed male and femalesamples, it will no longer be affected the different efficiencies ofeach assay at each locus. Since the goal is to differentiate twomeasurements each with a roughly similar SD, the adjusted SD may beapproximated for each measurement for all loci as 0.99/sqrt(2)=0.70. Tworuns were conducted for every locus in order to estimate G_(xi) for theassay at that locus so that a matched filter could be applied. A lowerlimit of σ_(xi) was set at 0.2 in order to avoid statistical anomaliesresulting from only two runs to compute σ_(xi). Only those loci(numbering 37) for which there were no allele dropouts over bothalleles, over both experiment runs and over both male and female sampleswere used in the plots and calculations. Applying the approach above tothis data, it was found that MSNR=2.26, hence N=2²5²/2.26{circumflexover ( )}=17 loci.

Method 1b: Measuring Aneuploidy or Sex by Quantitative Techniques thatdo not Make Allele Calls when the Mean and Std. Deviation is not Knownor is Uniform

When the characteristics of each locus are not known well, thesimplifying assumptions that all the assays at each locus will behavesimilarly can be made, namely that E(m_(xi)) and G_(xi) are constantacross all loci i, so that it is possible to refer instead only toE(m_(x)) and σ_(x). In this case, the matched filtering approachm=h^(T)d_(x) reduces to finding the mean of the distribution of d_(x).This approach will be referred to as comparison of means, and it will beused to estimate the number of loci required for different kinds ofdetection using real data.

As above, consider the scenario when there are two chromosomes presentin the sample (hypothesis h₀) or one chromosome present (h₁). For h₀,the distribution is N(μ₀,σ₀ ²) and for h₁ the distribution is N(μ₁, σ₁²). Measure each of the distributions using N₀ and N₁ samplesrespectively, with measured sample means and SDs m₁, m₀, s₁, and s₀. Themeans can be modeled as random variables M₀, M_(i) that are normallydistributed as M₀˜N(μ₀, σ₀ ²/N₀) and M₁˜N(μ₁, σ₁ ²/N₁). Assume N₁ and N₀are large enough (>30) so that one can assume that M₁˜N(m₁, s₁ ²/N₁) andM₀˜N(m₀, s₀ ²/N₀). In order to test whether the distributions aredifferent, the difference of the means test may be used, where d=m₁−m₀.The variance of the random variable D is σ_(d) ²=σ₁ ²/N₁+σ₀ ²/N₀ whichmay be approximated as σ_(d) ²=s₁ ²/N₁+s₀ ²/N₀. Given h₀, E(d)=0; givenh₁, E(d)=μ₁−μ₀. Different techniques for making the call between h₁ forh₀ will now be discussed.

Data measured with a different run of the TAQMAN Assay using 48 SNPs onthe X chromosome was used to calibrate performance. Sample 1 consists ofroughly 0.3 ng of DNA per well of mixed male origin containing one Xchromosome; sample 0 consisted of roughly 0.3 ng of DNA per well ofmixed female origin containing two X chromosomes. N₁=42 and N₀=45. FIG.7 and FIG. 8 show the histograms for samples 1 and 0. The distributionsfor these samples are characterized by m₁=32.259, s₁=1.460,σ_(m1)=s₁/sqrt(N₁)=0.225; m₀=30.75; s₀=1.202, σ_(m0)=s₀/sqrt(N₀)=0.179.For these samples d=1.509 and σ_(d)=0.2879.

Since this data is derived from mixed male and female samples, much ofthe standard deviation is due to the different allele frequencies ateach SNP in the mixed samples. SD is estimated by considering thevariations in C_(t) for one SNP at a time, over multiple runs. This datais shown in FIG. 9. The histogram is symmetric around 0 since C_(t) foreach SNP is measured in two runs or experiments and the mean value of Ctfor each SNP is subtracted out. The average std. dev. across 20 SNPs inthe mixed male sample using two runs is s=0.597. This SD will beconservatively used for both male and female samples, since SD for thefemale sample will be smaller than for the male sample. In addition,note that the measurement from only one dye is being used, since themixed samples are assumed to be heterozygous for all SNPs. The use ofboth dyes requires the measurements of each allele at a locus to becombined, which is more complicated (see section on linearizingmeasurements). Combining measurements on both dyes would double signalamplitude and increase noise amplitude by roughly sqrt(2), resulting inan SNR improvement of roughly sqrt(2) or 3 dB.

Detection Assuming No Mosaicism and No Reference Sample

Assume that m₀ is known perfectly from many experiments, and everyexperiment runs only one sample to compute m₁ to compare with m₀. N₁ isthe number of assays and assume that each assay is a different SNPlocus. A threshold t can be set half way between m_(o) and m_(i) to makethe likelihood of false positives equal the number of false negatives,and a sample is labeled abnormal if it is above the threshold. Assumes₁=s₂=s=0.597 and use the 5-sigma approach so that the probability offalse negatives or positives is 1−normcdf(5,0,1)=2.87e−7. The goal isfor 5s₁/sqrt(N₁)<(m₁−m₀)/2, hence N₁=100 s₁ ²/(m₁−m₀)²=16. Now, anapproach where the probability of a false positive is allowed to behigher than the probability of a false negatives, which is the harmfulscenario, may also be used. If a positive is measured, the experimentmay be rerun. Consequently, it is possible to say that the probabilityof a false negative should be equal to the square of the probability ofa false positive. Consider FIG. 3, let t=threshold, and assumeSigma_0=Sigma_1=s. Thus(1-normcdf((t−m₀)/s,0,1))²=1−normcdf((m₁−t)/s,0,1). Solving this, it canbe shown that t=m₀+0.32(m₁−m₀). Hence the goal is for5s/sqrt(N₁)<m₁−m₀−0.32(m₁−m₀)=(m₁−m₀)/1.47, henceN₁=(5²)(1.47²)s²/(m₁−m₀)²=9.

Detection with Mosaicism without Running a Reference Sample

Assume the same situation as above, except that the goal is to detectmosaicism with a probability of 97.7% (i.e. 2-sigma approach). This isbetter than the standard approach to amniocentesis which extractsroughly 20 cells and photographs them. If one assumes that 1 in 20 cellsis aneuploid and this is detected with 100% reliability, the probabilityof having at least one of the group being aneuploid using the standardapproach is 1−0.95²0=64%. If 0.05% of the cells are aneuploid (call thissample 3) then m₃=0.95m₀+0.05m₁ and var(m₃)=(0.95s₀ ²+0.05s₁ ²)/N₁. Thusstd(m₃)2<(m₃−m₀)/2=>sqrt(0.95s₀²+0.05s²)/sqrt(N₁)<0.05(m₁−m₂)/4=>N₁=16(0.95s₂ ²+0.05s₁²)/(0.05²(m₁-m₂)²)=1001. Note that using the goal of 1-sigma statistics,which is still better than can be achieved using the conventionalapproach (i.e. detection with 84.1% probability), it can be shown in asimilar manner that N₁=250.

Detection with No Mosaicism and Using a Reference Sample

Although this approach may not be necessary, assume that everyexperiment runs two samples in order to compare m_(i) with truth samplem₂. Assume that N=N₁=N₀. Compute d=m₁-m₀ and, assuming σ₁=σ₀, set athreshold t=(m₀+m₁)/2 so that the probability of false positives andfalse negatives is equal. To make the probability of false negatives2.87e−7, it must be the case that (m₁−m₂)/2>5 sqrt(s₁ ²/N+s₂²/N)=>N=100(s₁ ²+s₂ ²)/(m₁−m₂)²=32.

Detection with Mosaicism and Running a Reference Sample

As above, assume the probability of false negatives is 2.3% (i.e.2-sigma approach). If 0.05% of the cells are aneuploid (call this sample3) then m₃=0.95m₀+0.05m₁ and var(m₃)=(0.95s₀ ²+0.05s₁ ²)/N₁. d=m₃−m₂ andσ_(d) ²=(1.95s₀ ²+0.05s₁ ²)/N. It must be thatstd(m₃)2<(m₀−m₂)/2=>sqrt(1.95s₂ ²+0.05s₁²)/sqrt(N)<0.05(m₁−m₂)/4=>N=16(1.95s₂ ²+0.05s₁ ²)/(0.05²(m₁−m₂)²)=2002.Again using 1-sigma approach, it can be shown in a similar manner thatN=500.

Consider the case if the goal is only to detect 5% mosaicism with aprobability of 64% as is the current state of the art. Then, theprobability of false negative would be 36%. In other words, it would benecessary to find x such that 1-normcdf(x,0,1)=36%. ThusN=4(0.36{circumflex over ( )}2)(1.95s₂ ²+0.05s₁ ²)/(0.05²(m₁−m₂)²)=65for the 2-sigma approach, or N=33 for the 1-sigma approach. Note thatthis would result in a very high level of false positives, which needsto be addressed, since such a level of false positives is not currentlya viable alternative.

Also note that if N is limited to 384 (i.e. one 384 well TAQMAN plateper chromosome), and the goal is to detect mosaicism with a probabilityof 97.72%, then it will be possible to detect mosaicism of 8.1% usingthe 1-sigma approach. In order to detect mosaicism with a probability of84.1% (or with a 15.9% false negative rate), then it will be possible todetect mosaicism of 5.8% using the 1-sigma approach. To detect mosaicismof 19% with a confidence of 97.72% it would require roughly 70 loci.Thus one could screen for 5 chromosomes on a single plate.

The summary of each of these different scenarios is provided in FIG. 20.Also included in FIG. 20 are the results generated from qPCR and theSYBR assays. The methods described above were used and the simplifyingassumption was made that the performance of the qPCR assay for eachlocus is the same. FIG. 10 and FIG. 11 show the histograms for samples 1and 0, as described above. N₀=N₁=47. The distributions of themeasurements for these samples are characterized by m₁=27.65, s₁=1.40,σ_(m1)=s₁/sqrt(N₁)=0.204; m₀=26.64; s₀=1.146, σ_(m0)=s₀/sqrt(N₀)=0.167.For these samples d=1.01 and σ_(d)=0.2636. FIG. 12 shows the differencebetween C_(t) for the male and female samples for each locus, with astandard deviation of the difference over all loci of 0.75. The SD wasapproximated for each measurement of each locus on the male or femalesample as 0.75/sqrt(2)=0.53.

Method 2: Qualitative Techniques that Use Allele Calls

In this section, no assumption is made that the assay is quantitative.Instead, the assumption is that the allele calls are qualitative, andthat there is no meaningful quantitative data coming from the assays.This approach is suitable for any assay that makes an allele call. FIG.13 describes how different haploid gametes form during meiosis, and willbe used to describe the different kinds of aneuploidy that are relevantfor this section. The best algorithm depends on the type of aneuploidythat is being detected.

Consider a situation where aneuploidy is caused by a third segment thathas no section that is a copy of either of the other two segments. FromFIG. 13, the situation would arise, for example, if p₁ and p₄, or p₂ andp₃, both arose in the child cell in addition to one segment from theother parent. This is very common, given the mechanism which causesaneuploidy. One approach is to start off with a hypothesis h₀ that thereare two segments in the cell and what these two segments are. Assume,for the purpose of illustration, that h₀ is for p₃ and m₄ from FIG. 13.In a preferred embodiment this hypothesis comes from algorithmsdescribed elsewhere in this document. Hypothesis h₁ is that there is anadditional segment that has no sections that are a copy of the othersegments. This would arise, for example, if p₂ or m_(i) was alsopresent. It is possible to identify all loci that are homozygous in p₃and m₄. Aneuploidy can be detected by searching for heterozygousgenotype calls at loci that are expected to be homozygous.

Assume every locus has two possible alleles, x and y. Let theprobability of alleles x and y in general be p_(x) and p_(y)respectively, and p_(x)+p_(y)=1. If h₁ is true, then for each locus ifor which p₃ and m₄ are homozygous, then the probability of anon-homozygous call is p_(y) or p_(x), depending on whether the locus ishomozygous in x or y respectively. Note: based on knowledge of theparent data, i.e. p₁, p₂, p₄ and m_(i), m₂, m₃, it is possible tofurther refine the probabilities for having non-homozygous alleles x ory at each locus. This will enable more reliable measurements for eachhypothesis with the same number of SNPs, but complicates notation, sothis extension will not be explicitly dealt with. It should be clear tosomeone skilled in the art how to use this information to increase thereliability of the hypothesis.

The probability of allele dropouts is p_(d). The probability of findinga heterozygous genotype at locus i is poi given hypothesis h₀ and p_(1i)given hypothesis h₁.

Given h₀: p_(0i)=0

Given h₁: p_(1i)=p_(x)(1−p_(d)) or p_(1i)=p_(y)(1−p_(d)) depending onwhether the locus is homozygous for x or y.

Create a measurement m=1/N_(h) Σ_(i=1 . . . Nh) I_(i) where I_(i) is anindicator variable, and is 1 if a heterozygous call is made and 0otherwise. N_(h) is the number of homozygous loci. One can simplify theexplanation by assuming that p_(x)=p_(y) and p_(0i), p_(1i) for all lociare the same two values p₀ and p₁. Given h₀, E(m)=p₀=0 and σ²_(mlh0)=p₀(1−p₀)/N_(h). Given h₁, E(m)=p₁ and σ² _(mlh1)=p₁(1−p₁)/N_(h).Using 5 sigma-statistics, and making the probability of false positivesequal the probability of false negatives, it can be shown that(p₁−p₀)/2>5σ_(mlh1) hence N_(h)=100(p₀(1−p₀)+p₁(1−p₁))/(p₁−p₀)². For2-sigma confidence instead of 5-sigma confidence, it can be shown thatN_(h)=4.2²(p₀(1−p₀)+p₁(1−p₁))/(p₁−p₀)².

It is necessary to sample enough loci N that there will be sufficientavailable homozygous loci N_(h-avail) such that the confidence is atleast 97.7% (2-sigma). Characterize N_(h-avail)=Σ_(i=1 . . . N) J_(i)where J_(i) is an indicator variable of value 1 if the locus ishomozygous and 0 otherwise. The probability of the locus beinghomozygous is p_(x) ²+p_(y) ². Consequently, E(N_(h-avail))=N(p_(x)²+p_(y) ²) and σ_(Nh-avail) ²=N(p_(x) ²+p_(y) ²)(1-p_(x) ²−p_(y) ²). Toguarantee N is large enough with 97.7% confidence, it must be thatE(N_(h-avail))−²σN_(h-avail)=N_(h) where N_(h) is found from above.

For example, if one assumes p_(d)=0.3, p_(x)=p_(y)=0.5, one can findN_(h)=186 and N=391 for 5-sigma confidence. Similarly, it is possible toshow that N_(h)=30 and N=68 for 2-sigma confidence i.e. 97.7% confidencein false negatives and false positives.

Note that a similar approach can be applied to looking for deletions ofa segment when h₀ is the hypothesis that two known chromosome segmentare present, and h₁ is the hypothesis that one of the chromosomesegments is missing. For example, it is possible to look for all ofthose loci that should be heterozygous but are homozygous, factoring inthe effects of allele dropouts as has been done above.

Also note that even though the assay is qualitative, allele dropoutrates may be used to provide a type of quantitative measure on thenumber of DNA segments present.

Method 3: Making Use of Known Alleles of Reference Sequences, andQuantitative Allele Measurements

Here, it is assumed that the alleles of the normal or expected set ofsegments are known. In order to check for three chromosomes, the firststep is to clean the data, assuming two of each chromosome. In apreferred embodiment of the invention, the data cleaning in the firststep is done using methods described elsewhere in this document. Thenthe signal associated with the expected two segments is subtracted fromthe measured data. One can then look for an additional segment in theremaining signal. A matched filtering approach is used, and the signalcharacterizing the additional segment is based on each of the segmentsthat are believed to be present, as well as their complementarychromosomes. For example, considering FIG. 13, if the results of PSindicate that segments p2 and m1 are present, the technique describedhere may be used to check for the presence of p2, p3, m1 and m4 on theadditional chromosome. If there is an additional segment present, it isguaranteed to have more than 50% of the alleles in common with at leastone of these test signals. Note that another approach, not described indetail here, is to use an algorithm described elsewhere in the documentto clean the data, assuming an abnormal number of chromosomes, namely 1,3, 4 and 5 chromosomes, and then to apply the method discussed here. Thedetails of this approach should be clear to someone skilled in the artafter having read this document.

Hypothesis h₀ is that there are two chromosomes with allele vectors a₁,a₂. Hypothesis h₁ is that there is a third chromosome with allele vectora₃. Using a method described in this document to clean the genetic data,or another technique, it is possible to determine the alleles of the twosegments expected by h₀: a₁=[a₁₁ . . . a_(1N)] and a₂=[a₂₁ . . . a_(2N)]where each element a_(ji) is either x or y. The expected signal iscreated for hypothesis h₀: s_(0x)=[f_(0x)(a₁₁, a₂₁) . . . f_(x0)(a_(1N),a_(2N))], s_(0y)=[f_(y)(a₁₁, a₂₁) . . . f_(y)(a_(1N), a_(2N))] wheref_(x), f_(y) describe the mapping from the set of alleles to themeasurements of each allele. Given h₀, the data may be described asd_(xi)=s_(0xi)+n_(xi), n_(xi)˜N(0, σ_(xi) ²); d_(yi)=S_(0yi)+n_(yi),n_(yi)˜N(0, σ_(yi) ²). Create a measurement by differencing the data andthe reference signal: m_(xi)=d_(xi)−s_(xi); m_(yi)=d_(yi)−s_(yi). Thefull measurement vector is m=[m_(x) ^(T) m_(y) ^(T)]^(T).

Now, create the signal for the segment of interest—the segment whosepresence is suspected, and will be sought in the residual—based on theassumed alleles of this segment: a₃=[a₃₁ . . . a_(3N)]. Describe thesignal for the residual as: s_(r)=[s_(rx) ^(T) s_(ry) ^(T)]^(T) wheres_(rx)=[f_(rx)(a₃₁) . . . f_(rx)(a_(3N))], s_(ry)=[f_(ry)(a₃₁) . . .f_(ry)(a_(3N))] where f_(rx)(a_(3i))=δ_(xi) if a_(3i)=x and 0 otherwise,f_(ry)(a_(3i))=δ_(yi) if a_(3i)=y and 0 otherwise. This analysis assumesthat the measurements have been linearized (see section below) so thatthe presence of one copy of allele x at locus i generates dataδ_(xi)+n_(xi) and the presence of κ_(x) copies of the allele x at locusi generates data κ_(x)δ_(xi)+n_(xi). Note however that this assumptionis not necessary for the general approach described here. Given h₁, ifallele a_(3i)=x then m_(xi)=δ_(xi)+n_(xi), m_(yi)=n_(yi) and if a_(3i)=ythen m_(xi)=n_(xi), m_(yi)=δ_(yi)+n_(yi). Consequently, a matched filterh=(1/N)R⁻¹s_(r) can be created where R=diag([σ_(x1) ² . . . σ_(xN) ²σ_(y1) ² . . . σ_(yN) ²]). The measurement is m=h^(T)d.

h ₀ : m=(1/N)Σ_(i=1 . . . N) s _(rxi) n _(xi)/σ_(xi) ² +s _(ryi) n_(yi)/σ_(yi) ²

h ₁ : m=(1/N)Σ_(i=1 . . . N) s _(rxi)(δ_(xi) +n _(xi))/σ_(xi) ² +s_(ryi)(δ_(yi) +n _(yi))/σ_(yi) ²

In order to estimate the number of SNPs required, make the simplifyingassumptions that all assays for all alleles and all loci have similarcharacteristics, namely that δ_(xi)=σ_(yi)=δ and σ_(xi)=σ_(yi)=σ for i=1. . . N. Then, the mean and standard deviation may be found as follows:

h ₀ : E(m)=m ₀=0;σ_(mlh0)2=(1/N ²σ⁴)(N/2)(σ²δ²+σ²δ²)=δ²/(Nσ ²)

h ₁ : E(m)=m ₁=(1/N)(N/2σ²)(δ²+δ²)=δ²/σ²;σ_(mlh1) ²=(1/N²σ⁴)(N)(σ²δ²)=δ²/(Nσ ²)

Now compute a signal-to-noise ratio (SNR) for this test of h₁ versus h₀.The signal is m₁−m₀=δ²/σ², and the noise variance of this measurement isσ_(mlh0) ²+σ_(mlh1) ²=2δ²/(Nσ²). Consequently, the SNR for this test is(δ⁴/δ⁴)/(2δ²/(Nσ²))=Nδ²/(2σ²).

Compare this SNR to the scenario where the genetic information is simplysummed at each locus without performing a matched filtering based on theallele calls. Assume that h=(1/N)i where i is the vector of N ones, andmake the simplifying assumptions as above that δ_(xi)=δ_(yi)=δ andσ_(xi)=σ_(yi)=σ for i=1 . . . N. For this scenario, it isstraightforward to show that if m=h^(T)d:

h ₀ : E(m)=m ₀=0;σ_(mlh0) ² =Nσ ² /N ² +Nσ ² /N ²=2σ² /N

h ₁ : E(m)=m ₁=(1/N)(Nδ/2+Nδ/2)=δ;σ_(mlh1) ²=(1/N ²)(Nσ ² +Nσ ²)=2σ² /N

Consequently, the SNR for this test is Nδ²/(4σ²). In other words, byusing a matched filter that only sums the allele measurements that areexpected for segment a₃, the number of SNPs required is reduced by afactor of 2. This ignores the SNR gain achieved by using matchedfiltering to account for the different efficiencies of the assays ateach locus.

Note that if we do not correctly characterize the reference signalss_(xi) and s_(yi) then the SD of the noise or disturbance on theresulting measurement signals m_(xi) and m_(yi) will be increased. Thiswill be insignificant if δ<<σ, but otherwise it will increase theprobability of false detections. Consequently, this technique is wellsuited to test the hypothesis where three segments are present and twosegments are assumed to be exact copies of each other. In this case,s_(xi) and s_(yi) will be reliably known using techniques of datacleaning based on qualitative allele calls described elsewhere. In oneembodiment method 3 is used in combination with method 2 which usesqualitative genotyping and, aside from the quantitative measurementsfrom allele dropouts, is not able to detect the presence of a secondexact copy of a segment.

We now describe another quantitative technique that makes use of allelecalls. The method involves comparing the relative amount of signal ateach of the four registers for a given allele. One can imagine that inthe idealized case involving a single, normal cell, where homogenousamplification occurs, (or the relative amounts of amplification arenormalized), four possible situations can occur: (i) in the case of aheterozygous allele, the relative intensities of the four registers willbe approximately 1:1:0:0, and the absolute intensity of the signal willcorrespond to one base pair; (ii) in the case of a homozygous allele,the relative intensities will be approximately 1:0:0:0, and the absoluteintensity of the signal will correspond to two base pairs; (iii) in thecase of an allele where ADO occurs for one of the alleles, the relativeintensities will be approximately 1:0:0:0, and the absolute intensity ofthe signal will correspond to one base pair; and (iv) in the case of anallele where ADO occurs for both of the alleles, the relativeintensities will be approximately 0:0:0:0, and the absolute intensity ofthe signal will correspond to no base pairs.

In the case of aneuploides, however, different situations will beobserved. For example, in the case of trisomy, and there is no ADO, oneof three situations will occur: (i) in the case of a triply heterozygousallele, the relative intensities of the four registers will beapproximately 1:1:1:0, and the absolute intensity of the signal willcorrespond to one base pair; (ii) in the case where two of the allelesare homozygous, the relative intensities will be approximately 2:1:0:0,and the absolute intensity of the signal will correspond to two and onebase pairs, respectively; (iii) in the case where are alleles arehomozygous, the relative intensities will be approximately 1:0:0:0, andthe absolute intensity of the signal will correspond to three basepairs. If allele dropout occurs in the case of an allele in a cell withtrisomy, one of the situations expected for a normal cell will beobserved. In the case of monosomy, the relative intensities of the fourregisters will be approximately 1:0:0:0, and the absolute intensity ofthe signal will correspond to one base pair. This situation correspondsto the case of a normal cell where ADO of one of the alleles hasoccurred, however in the case of the normal cell, this will only beobserved at a small percentage of the alleles. In the case ofuniparental disomy, where two identical chromosomes are present, therelative intensities of the four registers will be approximately1:0:0:0, and the absolute intensity of the signal will correspond to twobase pairs. In the case of UPD where two different chromosomes from oneparent are present, this method will indicate that the cell is normal,although further analysis of the data using other methods described inthis patent will uncover this.

In all of these cases, either in cells that are normal, have aneuploidesor UPD, the data from one SNP will not be adequate to make a decisionabout the state of the cell. However, if the probabilities of each ofthe above hypothesis are calculated, and those probabilities arecombined for a sufficient number of SNPs on a given chromosome, onehypothesis will predominate, it will be possible to determine the stateof the chromosome with high confidence.

Methods for Linearizing Quantitative Measurements

Many approaches may be taken to linearize measurements of the amount ofgenetic material at a specific locus s_(o) that data from differentalleles can be easily summed or differenced. We first discuss a genericapproach and then discuss an approach that is designed for a particulartype of assay.

Assume data d_(xi) refers to a nonlinear measurement of the amount ofgenetic material of allele x at locus i. Create a training set of datausing N measurements, where for each measurement, it is estimated orknown that the amount of genetic material corresponding to data d_(xi)is δ_(xi). The training set δ_(xi), i=1 . . . N, is chosen to span allthe different amounts of genetic material that might be encountered inpractice. Standard regression techniques can be used to train a functionthat maps from the nonlinear measurement, d_(xi), to the expectation ofthe linear measurement, E(β_(xi)). For example, a linear regression canbe used to train a polynomial function of order P, such thatE(β_(xi))=[1 d_(xi) d_(xi) ² . . . d_(xi) ^(P)]c where c is the vectorof coefficients c=[c₀ c₁ . . . c_(P)]^(T). To train this linearizingfunction, we create a vector of the amount of genetic material for Nmeasurements β=[β_(x1) . . . β_(xN)]^(T) and a matrix of the measureddata raised to powers 0 . . . P: D=[[1 d_(x1) d_(x1) ² . . . d_(x1)^(P)]^(T) [1 d_(x2) d_(x2) ² . . . d_(x2) ^(P)]^(T) . . . [1 d_(xN)d_(xN) ² . . . d_(xN) ^(P)]^(T)]^(T). The coefficients can then be foundusing a least squares fit c=(D^(T)D)⁻¹D^(T)β_(x).

Rather than depend on generic functions such as fitted polynomials, wemay also create specialized functions for the characteristics of aparticular assay. We consider, for example, the TAQMAN assay or a qPCRassay. The amount of die for allele x and some locus i, as a function oftime up to the point where it crosses some threshold, may be describedas an exponential curve with a bias offset: g_(xi)(t)=α_(xi)+β_(xi)exp(γ_(xi) t) where α_(xi) is the bias offset, γ_(xi) is the exponentialgrowth rate, and β_(xi) corresponds to the amount of genetic material.To cast the measurements in terms of β_(xi), compute the parameterα_(xi) by looking at the asymptotic limit of the curve g_(xi)(−∞) andthen may find β_(xi) and y_(xi) by taking the log of the curve to obtainlog(g_(xi)(t)−α_(xi))=log(β_(xi))+γ_(xi) t and performing a standardlinear regression. Once we have values for α_(xi) and γ_(xi), anotherapproach is to compute β_(xi) from the time, t_(x), at which thethreshold g_(x) is exceeded. β_(xi)=(g_(x)−α_(xi))exp(−y_(xi) t_(x)).This will be a noisy measurement of the true amount of genetic data of aparticular allele.

Whatever techniques is used, we may model the linearized measurement asβ_(xi)=κ_(x)δ_(xi)+n_(xi) where κ_(x) is the number of copies of allelex, δ_(xi) is a constant for allele x and locus i, and n_(xi)˜N(0, σ_(x)²) where σ_(x) ² can be measured empirically.

Method 4: Using a Probability Distribution Function for theAmplification of Genetic Data at Each Locus

The quantity of material for a particular SNP will depend on the numberof initial segments in the cell on which that SNP is present. However,due to the random nature of the amplification and hybridization process,the quantity of genetic material from a particular SNP will not bedirectly proportional to the starting number of segments. Let q_(s,A),q_(s,G), q_(s,T), q_(s,c) represent the amplified quantity of geneticmaterial for a particular SNP s for each of the four nucleic acids(A,C,T,G) constituting the alleles. Note that these quantities may beexactly zero, depending on the technique used for amplification. Alsonote that these quantities are typically measured from the intensity ofsignals from particular hybridization probes. This intensity measurementcan be used instead of a measurement of quantity, or can be convertedinto a quantity estimate using standard techniques without changing thenature of the invention. Let q_(S) be the sum of all the geneticmaterial generated from all alleles of a particular SNP:q_(s)=q_(s,A)+q_(s,G)+q_(s,T)+q_(s,c). Let N be the number of segmentsin a cell containing the SNP s. N is typically 2, but may be 0,1 or 3 ormore. For any high or medium throughput genotyping method discussed, theresulting quantity of genetic material can be represented asq_(s)=(A+A_(θ,s))N+θ_(s) where A is the total amplification that iseither estimated a-priori or easily measured empirically, A_(θ,s) is theerror in the estimate of A for the SNP s, and θ_(s) is additive noiseintroduced in the amplification, hybridization and other process forthat SNP. The noise terms A_(θ,s) and θ_(s) are typically large enoughthat q_(s) will not be a reliable measurement of N. However, the effectsof these noise terms can be mitigated by measuring multiple SNPs on thechromosome. Let S be the number of SNPs that are measured on aparticular chromosome, such as chromosome 21. It is possible to generatethe average quantity of genetic material over all SNPs on a particularchromosome as follows:

$\begin{matrix}{q = {{\frac{1}{S}{\sum\limits_{s = 1}^{S}q_{s}}} = {{NA} + {\frac{1}{S}{\sum\limits_{s = 1}^{S}{A_{\theta,s}N}}} + \theta_{s}}}} & (16)\end{matrix}$

Assuming that A_(θ,s) and θ_(s) are normally distributed randomvariables with 0 means and variances σ² _(A) _(θ,S) and σ² _(θS), onecan model q=NA+φ where φ is a normally distributed random variable with0 mean and variance

$\frac{1}{S}{\left( {{N^{2}\sigma_{A_{\theta,s}}^{2}} + \sigma_{\theta}^{2}} \right).}$

Consequently, if sufficient number of SNPs are measured on thechromosome such that S>>(N²σ² _(a) _(θ,S+σ) ² _(θ)), then N=q/A can beaccurately estimated.

In another embodiment, assume that the amplification is according to amodel where the signal level from one SNP is s=a+α where (a+α) has adistribution that looks like the picture in FIG. 14A, left. The deltafunction at 0 models the rates of allele dropouts of roughly 30%, themean is a, and if there is no allele dropout, the amplification hasuniform distribution from 0 to a₀. In terms of the mean of thisdistribution a₀ is found to be a₀=2.86a. Now model the probabilitydensity function of a using the picture in FIG. 14B, right. Let S_(c) bethe signal arising from c loci; let n be the number of segments; letα_(i) be a random variable distributed according to FIGS. 14A and 14Bthat contributes to the signal from locus i; and let σ be the standarddeviation for all {α_(i)}. s_(c)=anc+E_(i=1 . . . nc) α_(i);mean(s_(c))=anc; std(s_(c))=sqrt(nc)σ. If σ is computed according to thedistribution in FIG. 14A, right, it is found to be σ=0.907a². We canfind the number of segments from n=s_(c)/(ac) and for “5-sigmastatistics” we require std(n)<0.1 sostd(s_(c))/(ac)=0.1=>0.95a·sqrt(nc)/(ac)=0.1 so c=0.95² n/0.1²=181.

Another model to estimate the confidence in the call, and how many locior SNPs must be measured to ensure a given degree of confidence,incorporates the random variable as a multiplier of amplificationinstead of as an additive noise source, namely s=a(1+α). Taking logs,log(s)=log(a)+ log(1+α). Now, create a new random variable γ=log(1+α)and this variable may be assumed to be normally distributed ˜N(0, σ). Inthis model, amplification can range from very small to very large,depending on σ, but never negative. Therefore α=e^(γ)−1; ands_(c)=Σ_(i=1 . . . cn) a(1+α_(i)). For notation, mean(s_(c)) andexpectation value E(s_(c)) are used interchangeably

E(s_(c))=acn+aE(Σ_(i=1 . . . cn)α_(i))=acn+aE(Σ_(i=1 . . . cn)α_(i))=acn(1+E(α))

To find E(α) the probability density function (pdf) must be found for awhich is possible since α is a function of γ which has a known Gaussianpdf. p_(α)(α)=p_(γ)(γ)(dγ/dα). So:

${p_{\gamma}(\gamma)} = {\frac{1}{\sqrt{2\; \pi}\sigma}e^{\frac{- \gamma^{2}}{2\; \sigma^{2}}}\mspace{20mu} {and}}$$\frac{d\; \gamma}{d\; \alpha} = {{\frac{d}{d\; \alpha}\left( {\log \left( {1 + \alpha} \right)} \right)} = {\frac{1}{1 + \alpha} = e^{- \gamma}}}$

and:

${p_{\alpha}(\alpha)} = {{\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{- \gamma^{2}}{2\; \sigma^{2}}}e^{- \gamma}} = {\frac{1}{\sqrt{2\; \pi}\sigma}e^{\frac{- {({\log {({1 + \alpha})}})}^{2}}{2\; \sigma^{2}}}\frac{1}{1 + \alpha}}}$

This has the form shown in FIG. 15 for σ=1. Now, E(α) can be found byintegrating over this pdf E(α)=∫_(−∞) ^(∞)αp_(α)(α)dα which can be donenumerically for multiple different σ. This gives E(s_(c)) or mean(s_(c))as a function of σ. Now, this pdf can also be used to find var(s_(c)):

$\begin{matrix}{{{var}\left( s_{c} \right)} = {E\left( {s_{c} - {E\left( s_{c} \right)}} \right)}^{2}} \\{= {E\left( {{\sum\limits_{i = {1\ldots \; {cn}}}{a\left( {1 + \alpha_{i}} \right)}} - {acn} - {{aE}\left( {\sum\limits_{i = {1\; \ldots \; {cn}}}\alpha_{i}} \right)}} \right)}^{2}} \\{= {E\left( {{\sum\limits_{i = {1\ldots \; {cn}}}{a\; \alpha_{i}}} - {{aE}\left( {\sum\limits_{i = {1\; \ldots \; {cn}}}\alpha_{i}} \right)}} \right)}^{2}} \\{= {a^{2}{E\left( {{\sum\limits_{i = {1\ldots \; {cn}}}\alpha_{i}} - {{cnE}(\alpha)}} \right)}^{2}}} \\{= {a^{2}{E\left( {\left( {\sum\limits_{i = {1\ldots \; {cn}}}\alpha_{i}} \right)^{2} - {2{{cnE}(\alpha)}\left( {\sum\limits_{i = {1\ldots \; {cn}}}^{\;}\alpha_{i}} \right)} + {c^{2}n^{2}{E(\alpha)}^{2}}} \right)}}} \\{= {a^{2}{E\left( {{{cn}\; \alpha^{2}} + {{{cn}\left( {{cn} - 1} \right)}\alpha_{i}\alpha_{j}} - {2{{cnE}(\alpha)}\left( {\sum\limits_{i = {1\ldots \; {cn}}}\alpha_{i}} \right)} + {c^{2}n^{2}{E(\alpha)}^{2}}} \right)}}} \\{= {a^{2}c^{2}{n^{2}\left( {{E\left( \alpha^{2} \right)} + {\left( {{cn} - 1} \right){E\left( {\alpha_{i}\alpha_{j}} \right)}} - {2{{cnE}(\alpha)}^{2}} + {{cnE}(\alpha)}^{2}} \right)}}} \\{= {a^{2}c^{2}{n^{2}\left( {{E\left( \alpha^{2} \right)} + {\left( {{cn} - 1} \right){E\left( {\alpha_{i}\alpha_{j}} \right)}} - {{cnE}(\alpha)}^{2}} \right)}}}\end{matrix}$

which can also be solved numerically using p_(α)(α) for multipledifferent σ to get var(s_(c)) as a function of σ. Then, we may take aseries of measurements from a sample with a known number of loci c and aknown number of segments n and find std(s_(c))/E(s_(c)) from this data.That will enable us to compute a value for σ. In order to estimate n,E(s_(c))=nac(1+E(α)) so

$\hat{n} = \frac{s_{c}}{a\; {c\left( {1 + {E(\alpha)}} \right)}}$

can be measured so that

${{std}\left( \hat{n} \right)} = {\frac{{std}\left( s_{c} \right)}{a\; {c\left( {1 + {E(\alpha)}} \right)}}{{std}(n)}}$

When summing a sufficiently large number of independent random variablesof 0-mean, the distribution approaches a Gaussian form, and thus s_(c)(and {circumflex over (n)}) can be treated as normally distributed andas before we may use 5-sigma statistics:

${{std}\left( \hat{n} \right)} = {\frac{{std}\left( s_{c} \right)}{a\; {c\left( {1 + {E(\alpha)}} \right)}} < 0.1}$

in order to have an error probability of 2 normcdf(5,0,1)=2.7e−7. Fromthis, one can solve for the number of loci c.

Sexing

In one embodiment of the system, the genetic data can be used todetermine the sex of the target individual. After the method disclosedherein is used to determine which segments of which chromosomes from theparents have contributed to the genetic material of the target, the sexof the target can be determined by checking to see which of the sexchromosomes have been inherited from the father: X indicates a female,and Y indicates a male. It should be obvious to one skilled in the arthow to use this method to determine the sex of the target.

Validation of the Hypotheses

In some embodiments of the system, one drawback is that in order to makea prediction of the correct genetic state with the highest possibleconfidence, it is necessary to make hypotheses about every possiblestate. However, as the possible number of genetic states areexceptionally large, and computational time is limited, it may not bereasonable to test every hypothesis. In these cases, an alternativeapproach is to use the concept of hypothesis validation. This involvesestimating limits on certain values, sets of values, properties orpatterns that one might expect to observe in the measured data if acertain hypothesis, or class of hypotheses are true. Then, the measuredvalues can be tested to see if they fall within those expected limits,and/or certain expected properties or patterns can be tested for, and ifthe expectations are not met, then the algorithm can flag thosemeasurements for further investigation.

For example, in a case where the end of one arm of a chromosome isbroken off in the target DNA, the most likely hypothesis may becalculated to be “normal” (as opposed, for example to “aneuploid”). Thisis because the particular hypotheses that corresponds to the true stateof the genetic material, namely that one end of the chromosome hasbroken off, has not been tested, since the likelihood of that state isvery low. If the concept of validation is used, then the algorithm willnote that a high number of values, those that correspond to the allelesthat lie on the broken off section of the chromosome, lay outside theexpected limits of the measurements. A flag will be raised, invitingfurther investigation for this case, increasing the likelihood that thetrue state of the genetic material is uncovered.

It should be obvious to one skilled in the art how to modify thedisclosed method to include the validation technique. Note that oneanomaly that is expected to be very difficult to detect using thedisclosed method is balanced translocations.

Application of the Method with Contaminated DNA

In one embodiment of the system, genetic data from target DNA which hasbeen definitely or possibly contaminated with foreign DNA can also becleaned using the disclosed method. The concept outlined above, that ofhypothesis validation, can be used to identify genetic samples that falloutside of expected limits; in the case of contaminated samples it isexpected that this validation will cause a flag to be raised, and thesample can be identified as contaminated.

Since large segments of the target DNA will be known from the parentalgenetic data, and provided the degree of contamination is sufficientlylow and sufficient SNPs are measured, the spurious data due to theforeign genetic material can be identified. The method disclosed hereinshould still allow for the reconstruction of the target genome, albeitwith lower confidence levels. Provided that the level of contaminationis sufficiently low, the hypothesis that is calculated to be most likelyis still expected to correspond to the true state of the geneticmaterial in the target DNA sample.

It should be obvious to one skilled in the art how to optimize thesemethods for the purpose cleaning genetic data contaminated with spurioussignals due to foreign DNA.

Example of Reduction to Practice

In one embodiment of the system, the method described above can beimplemented using a set of algorithms which will calculate the mostlikely identity of each SNP in a list of relevant SNPs, as well as aconfidence level for each SNP call. Described here is one possible wayto implement the method disclosed in this patent. FIG. 16 and FIG. 17visually represent the breakdown of this implementation of the disclosedmethod, the input requirements and the format of the output. Note thatthe implementations discussed here were done using the computer programMatlab, and a familiarity with this product will facilitate theunderstanding of the examples.

FIG. 16 focuses on the input data (1601) and its format andrequirements, as well as the output data (1605) and its format. Input tothe algorithm consists of the measured data (1602), including input bythe user, and existing data (1603) preserved in the database, that isconsequently updated by the newly collected data. The measured data (MD,1602) consists of the genetic data as measured for desired SNPs for theembryo, and the paternal and maternal alleles, as well as the accuracy,or confidence with which each of the alleles is known. The existing data(1603) consists of the population frequency data (FD), measurement biasdata (BD), and crossover data (CD).

The population frequency data (FD) contains the allele frequency (foreach of the values A,C,T,G) for each of the SNPs available. The data canbe previously known or measured, and can be updated with newly collecteddata as described elsewhere in this document.

Measurement bias data (BD) captures the bias of the measurement processtowards certain values. For example, assuming the true value of theallele is X=A, and probability of the correct measurement is p_(x), thedistribution of the measured value x is:

A C T G Probability p_(X) p_(C) P_(T) P_(G) probability with no biasp_(X) (1 − p_(X))/3 (1 − p_(X))/3 (1 − p_(X))/3where p_(X)+p_(C)+p_(T)+p_(G)=1. If there is no bias of measurementtowards any of the values then p_(C)=p_(T)=p_(G)=(1−p_(X))/3. Thisinformation can be discerned from empirical and theoretical knowledgeabout the mechanism of the measurement process and the relevantinstruments.

Crossover data (CD) consists of a database of genetic distances andcrossover probabilities between pairs of snips, collected from HAPMAPdata.

Together, (MD), (FD), (BD), (CD) make up the necessary input to thedisclosed method (termed ‘Parental Support’, 1604) algorithm. Thisalgorithm (1604) then operates on the input data to generate the outputdata (1605), which describes the most likely “true” value of thetarget's genetic data given the measured values, as well as the mostlikely origin of each SNP in terms of the parental alleles.

FIG. 17 focuses on the structure of the algorithm itself (termed‘Parental Support’) and how each of these input data are utilized by thealgorithm. Working backwards: to find the most likely hypotheses it isnecessary to calculate P(H|M) 1707, the probability of the hypothesisgiven the measurement, for all the possible hypotheses H.

As described previously:

${{P\left( H \middle| M \right)} = {\frac{P\left( M \middle| H \right)}{P(M)}{P(H)}}},{{P(M)} = {\sum\limits_{h \in S_{H}}{{P\left( M \middle| h \right)}{P(h)}}}}$

In order to find P(H|M) (1710), it is first necessary to find P(M|H)(1707), and P(H) (1708), for all hypotheses H. This allows thecalculation of P(M), 1709 by the equation shown above. The probabilityof the hypothesis P(H) 1708 depends on how many crossovers are assumedand the likelihood of each of these crossovers (CD, 1704), as explainedabove.

P(M|H) can be calculated using the following equation:

${{P\left( M \middle| H \right)} = {\sum\limits_{t}{{P\left( {{\left. M \middle| H \right.\&}\mspace{11mu} t} \right)}{P(t)}}}},$

as explained previously.

P(t), 1706 is the frequency of a particular value t for paternal andmaternal alleles and is derived from population frequency data (FD,1703). P(M|H&t), 1705 is the probability of correctly measuring theallele values of the embryo, the father, and the mother, assuming aparticular “true” value t. The measurement data and accuracy entered bythe user (MD, 1701), and the measurement bias database (BD, 1702) arethe inputs required to calculate P(M|H&t), 1705.

A more detailed description of the method is given forthwith. Begin withSNPs R={r₁, . . . ,r_(k)}, (a set of k SNPs), and the correspondingmeasured identities of parents and embryo, M=(e₁,e₂,p₁,p₂,m₁,m₂), for kSNPs, identified with id's s₁, . . . ,s_(k), where:

e₁=(e₁₁,e₁₂, . . . ,e_(1k)) is the measurement on one of the chromosomesof the embryo (they don't all have to come from the same parentalchromosome) for all the SNPs

e₂=(e₂₁,e₂₂, . . . ,e_(2k)) is the measurement on the other chromosomeof the embryo

p₁=(p₁₁,p₁₂, . . . ,p_(1k)) is the measurement on the FIRST chromosomeof the father (all coming from the same chromosome

p₂=(p₂₁,p₂₂, . . . ,p_(2k)) is the measurement on the SECOND chromosomeof the father (all coming from the same chromosome)

m₁=(m₁₁,m₁₂, . . . ,m_(1k)) is the measurement on the FIRST chromosomeof the mother (all coming from the same chromosome)

m₂=(m₂₁,m₂₂, . . . ,m_(2k)) is the measurement on the SECOND chromosomeof the mother (all coming from the same chromosome)

One can also write M={M₁, . . . ,M_(k)} whereM_(i)=(e_(1i),e_(2i),p_(1i),p_(2i))

The goal of the method is to determine the “true” embryo valueT=(E1,E2),i.e. the most likely case given the measurement M, where:

E₁=(E₁₁,E₁₂, . . . ,E_(1k)) is the measurement on the FIRST chromosomeof the embryo, corresponding to the PATERNAL chromosome, E_(1i)∈{p_(1i), p_(2i)}

E₂=(E₂₁,E₂₂, . . . ,E_(2k)) is the measurement on the SECOND chromosomeof the embryo, corresponding to the MATERNAL value, E_(2i)∈{m_(1i),m_(2i)}

One can also write T={T₁, . . . ,T_(k)} where T_(i)=(E_(1i),E_(2i)).

Effectively, the parental chromosome values (p₁,p₂,m₁,m₂) are being usedas support to check, validate and correct measured values of (e₁,e₂),hence the term “Parental Support Algorithm”.

To achieve this goal, all the possible hypotheses for the origin ofembryo values are developed and the most likely one is chosen, given themeasurement M. The hypotheses space is S_(H)={H¹, . . . ,H^(q)}={set ofall the hypotheses}, where each hypothesis is of the format H^(j)=(H^(j)₁, . . . H^(j) _(k)) where H^(j) _(i) is the “mini” hypothesis for SNPi, of the format H^(j) _(i)=(p_(i)*,m_(i)*) where p_(i)*∈{p_(1i),p_(2i)}and m_(i)*∈{m_(1i),m_(2i)}. There are 4 different “mini” hypothesesH^(j) _(i), in particular:

H ^(j) _(i)1:(e _(1i) ,e _(2i))={(p _(1i) ,m _(1i)) or (m _(1i) ,p_(1i))};H ^(j) _(i)2:(e _(1i) ,e _(2i))={(p _(1i) ,m _(2i)) or (m _(2i),p _(1i))}

H ^(j) _(i)3:(e _(1i) ,e _(2i))={(p _(2i) ,m _(1i)) or (m _(1i) ,p_(2i))};H ^(j) _(i)4:(e _(1i) ,e _(2i))={(p _(2i) ,m _(2i)) or (m _(2i),p _(2i))}

In theory, S^(H) can have q=4^(k) different members to pick from, thoughlater this space will be limited with a maximal number of crossovers ofpaternal and maternal chromosomes.

The most likely hypothesis H* is chosen to be as: H*=arg max_(H∈S) _(H)P(H|M) For a particular H:

${P\left( H \middle| M \right)} = {\frac{P\left( M \middle| H \right)}{\sum\limits_{h \in S_{H}}{{P\left( M \middle| h \right)}{P(h)}}}{{P(H)}.}}$

So deriving for each hypothesis:(1) P(M/H) is the probability of measurement M given the particularhypothesis H.(2) P(H) is the probability of the particular hypothesis H.(3) P(M) is the probability of the measurement M.After deriving P(H|M) for all H, the one with the greatest probabilityis chosen.

Deriving P(M|H)

Since measurements on each SNP are independent, for M=(M₁, . . . M_(k))and the particular hypothesis H=(H₁, . . . H_(k)) on all k SNPs then:

P(M|H)=P(M ₁ |H ₁)* . . . *P(M _(k) |H _(k))

For the particular SNP r, derive P(M_(r)|H_(r)). ForΩ={A,C,T,G}X{A,C,T,G}X={A,C,T,G}X{A,C,T,G}, the space of all thepossible values for “true” parent values (P_(1r),P_(2r),M_(1r),M_(2r)),by Bayes formula is:

${P\left( M_{r} \middle| H_{r} \right)} = {\sum\limits_{t \in \Omega}{{P\left( {{{{M_{r}/H_{r}}\;\&}\mspace{11mu} \left( {P_{1\; r},P_{2\; r},M_{1\; r},M_{2\; r}} \right)} = t} \right)}*{P\left( {\left( {P_{1\; r},P_{2\; r},M_{1\; r},M_{2\; r}} \right) = t} \right)}}}$

Deriving P(M_(r)/H_(r) & (P_(1r),P_(2r),M_(1r),M_(2r))=t)

M_(r)=(e_(1r),e_(2r),p_(1r),p_(2r),m_(1e),m_(2r)) is a given measurementon this SNP.

T=(E_(1r),E_(2r),P_(1r),P_(2r),M_(1r),M_(2r)) is the supposed “true”value, for t=(P_(1r),P_(2r),M_(1r),M_(2r)) and (E_(1r),E_(2r)) fixedfrom T by hypothesis. (E_(1r) is one of P_(1r),P_(2r), E_(2r) is one ofM_(1r),M_(2r))

P(M _(r)=(e _(1r) ,e _(2r) ,p _(1r) ,p _(2r) ,m _(1r) ,m _(2r))/T=(E_(1r) ,E _(2r) ,P _(r1) ,P _(2r) ,M _(1r) ,M _(2r)))=P(e _(1r) /E_(1r))*P(e _(2r) /E _(2r))*P(p _(1r) /P _(r1))*P(P _(2r) /P _(2r))*P(m_(1r) /M _(1r))*P(m _(2r) /M _(2r))

Given:

p_(eri)=P(accurately measuring the embryo value i,on SNP r)

p_(pri)=P(accurately measuring the father value i,on SNP r)

p_(mri)=P(accurately measuring the mother value i,on SNP r)

$\begin{matrix}{{P\left( {e_{1r}/E_{1r}} \right)} = \left\{ \begin{matrix}p_{{er}\; 1} & {e_{1\; r} = E_{1\; r}} \\\left( {1 - {P_{{{er}\; 1})}*{p\left( {e_{1\; r},E_{1\; r},r} \right)}}} \right. & {e_{1\; r} \neq E_{1\; r}}\end{matrix} \right.} \\{= {{I_{e_{1r} = E_{1r}}*p_{{er}\; 1}} + {I_{e_{1\; r} \neq E_{1\; r}}*\left( {1 - p_{{pr}\; 1}} \right)*}}} \\{{p\left( {e_{1\; r},E_{1\; r},r} \right)}} \\{= {F\left( {e_{1\; r},E_{1\; r},p_{e\; r\; 1},r} \right)}}\end{matrix}$

where p(e_(1r),E_(1r),r)=⅓ if there is no measurement bias, otherwise itcan be determined from experimental data, such as data from the Hapmapproject.Deriving P((P_(Ir),P_(2r), M_(1r),M_(2r))=t)

For t=(t1,t2,t3,t₄):

P((P _(1r) ,P _(2r) ,M _(1r) ,M _(2r))=(t ₁ ,t ₂ ,t ₃ ,t ₄))=P(P _(1r)=t ₁)*P(P _(2r) =t ₂)*P(M _(1r) =t ₃)*P(M=t ₄)

Suppose there are n samples of (P₁,P₂,M₁,M₂), all paternal and maternalvalues are assumed to be independent, and t=(t₁,t₂,t₃,t₄) for t_(i) in{A,C,T,G}

To get a particular p_(1A)=P(P₁=t₁), for t₁=A, assume that in absence ofany data this probability could be anything between 0 and 1, so it isassigned a value of U(0,1). With the acquisition of data, this isupdated with the new values and the distribution of this parameterbecomes a beta distribution. Suppose that out of n observations of P1,there are h values P1=A, and w=(event P₁=A) and D=(data given). It isdescribed in a prior section the form of the beta distribution B(α,β)with α=h+1, β=n−h+1 for p(w|Data) (see equation (8)). The expected valueand variance of X˜B(α,β) distribution are:

${EX} = \frac{\alpha}{\alpha + \beta}$${VX} = \frac{\alpha \; \beta}{\left( {\alpha + \beta} \right)^{2}\left( {\alpha + \beta + 1} \right)}$

So the posterior mean value of the parameterp_(1rA)=P(P_(1r)=A|Data)=(h+1)/(n+2) Similarlyp_(1rB)=(#(p_(1r)=B)+1)/(n+2), . . . m_(2rG)=(#(m_(2r)=G)+1)/(n+2), etc.Thus all the values p_(1rA), . . . ,m_(2rG) have been derived and:

P((P _(1r) ,P _(2r) ,M _(1r) ,M _(2r))=(t ₁ ,t ₂ ,t ₃ ,t ₄))=p _(1rt) ₁*p _(2rt) ₂ ,*m ^(1rt) ₃ *m _(2r1t) ₄

Deriving P(H)

The probability of the hypothesis H=(H₁, . . . ,H_(k)) withH_(i)=(p_(i)*,m_(i)*) depends on the amount of chromosome crossover. Forexample,

with P(crossover)=0 then P(H)=¼ and H=(p*,m*) if p*in{(p11,p21, . . .ps1), (p12,p22, . . . ,ps2), m* in {(m11,m21, . . . ,ms1),(m12,m22, . .. ,ms2)}, 0 otherwise

with P(crossover)>0 it is important to incorporate the probability ofcrossover between each SNP.

Hypothesis H consists of the hypothesis for paternal and maternalchromosomes for each SNP, p_(i)*∈{P_(1i),p_(2i)} and m_(1i)*∈{m_(1i),m_(2i)},i.e. H=(H_(p),H_(m)) where H_(p)=(p₁*, . . . p_(k)*), andH_(m)=(m₁*, . . . m_(k)*), which are independent.

P(H)=P(H_(p))*P(H_(m)). Suppose that SNP are ordered by increasinglocation,

${P\left( H_{p} \right)} = {\frac{1}{4}{\prod\limits_{i = 2}^{k}\; \left( {{{PC}_{i}*\left( {1 - I_{i}} \right)} + {\left( {1 - {PC}_{i}} \right)*I_{i}}} \right.}}$

where PC_(i)=P(crossover(r_(i−1), r_(i))) i.e. probability of crossoversomewhere between SNPs r_(i−1),r_(i), and I_(i)=1 if p_(i)*,p_(i−1)* arecoming both from p₁ or p₂, and it is 0 otherwise.

Deriving P(crossover(a,b))

Given SNPs a,b, at base locations l_(a),l_(b) (given in bases), theprobability of crossover is approximated as P(l_(a),l_(b))=0.5(1−exp(−2G(l_(a),l_(b)))) where G(l_(a),l_(b))=genetic distance in Morgansbetween locations l_(a),l_(b). There is no precise closed form functionfor G but it is loosely estimated as G(l_(a),l_(b))=|l_(a)−l_(b)|*1e⁻⁸.A better approximation can be used by taking advantage of the HapMapdatabase of base locations s_(i), and distances G(s_(i),s_(i+1)), for ispanning over all locations. In particular,

${{G\left( {l_{a},l_{b}} \right)} = {\sum\limits_{l_{a} < s_{i} < l_{b}}{G\left( {s_{i},s_{i + 1}} \right)}}},$

so it can be used in crossover probability.

Deriving P(M)

Once P(M|H) is known, P(H) can be found for all the different H inS_(H),

${P(M)} = {\sum\limits_{H \in S_{H}}{{P\left( M \middle| H \right)}{P(H)}}}$

A More Expedient Method to Derive the Hypothesis of Maximal Probability

Given the limitation of computer time, and the exponential scaling ofcomplexity of the above method as the number of SNPs increases, in somecases it may be necessary to use more expedient methods to determine thehypothesis of maximal probability, and thus make the relevant SNP calls.A more rapid way to accomplish this follows:

From before: P(H|M)=P(M|H)*P(H)/P(M), argmax_(H) P(H|M)=argmax_(H) andP(M|H)*P(H)=argmax_(H) F(M,H), and the object is to find H, maximizingF(M,H).

Suppose M_((s,k))=measurement on snips s to k, H_((s,k))=hypothesis onsnips s to k, and for shorts M_((k,k))=M_(k),H_((k,k))=H_(k)=measurement and hypothesis on snip k. As shown before:

${P\left( M_{({1,k})} \middle| H_{({1,k})} \right)} = {{\prod\limits_{i = 1}^{k}\; {P\left( M_{i} \middle| H_{i} \right)}} = {{{P\left( M_{k} \middle| H_{k} \right)}*{\prod\limits_{i = 1}^{k - 1}\; {P\left( M_{i} \middle| H_{i} \right)}}} = {{P\left( M_{k} \middle| H_{k} \right)}*{P\left( M_{({1,{k - 1}})} \middle| H_{({1,{k - 1}})} \right)}\mspace{14mu} {and}\mspace{20mu} {also}}}}$${P\left( H_{({1,k})} \right)} = {{{1/4}*{\prod\limits_{i = 2}^{k}\; {{PF}\left( {H_{i - 1},H_{i}} \right)}}} = {{{{PF}\left( {H_{k - 1},H_{k}} \right)}*{1/4}*{\prod\limits_{i = 2}^{k - 1}\; {{PF}\left( {H_{i - 1},H_{i}} \right)}}} = {{{PF}\left( {H_{k - 1},H_{k}} \right)}*{P\left( H_{({1,{k - 1}})} \right)}\mspace{14mu} {where}}}}$$\mspace{20mu} {{{PF}\left( {H_{i - 1},H_{i}} \right)} = \left\{ \begin{matrix}{1 - {{PC}\left( {H_{i - 1},H_{i}} \right)}} & {H_{i - 1} = H_{i}} \\{{PC}\left( {H_{i - 1},H_{i}} \right)} & {H_{i - 1} \neq H_{i}}\end{matrix} \right.}$

and PC(H_(i−1),H_(i))=probability of crossover between H_(i−1), H_(i)

So finally, for n snips:

$\begin{matrix}{{F\left( {M,H} \right)} = {{P\left( M \middle| H \right)}*{P(H)}}} \\{= {{P\left( M_{({1,n})} \middle| H_{({1,n})} \right)}*{P\left( H_{({1,n})} \right)}}} \\{= {{P\left( M_{({1,{n - 1}})} \middle| H_{({1,{n - 1}})} \right)}*{P\left( H_{({1,{n - 1}})} \right)}*}} \\{{{P\left( M_{n} \middle| H_{n} \right)}*{{PF}\left( {H_{n - 1},H_{n}} \right)}}}\end{matrix}$

therefore: F(M, H)=F(M_((1,n)), H_((1,n))))=F(M_((1,n-1)),H_((1,n-1)))*P(M_(n)|H_(n))*PF(H_(n-1), H_(n)) Thus, it is possible toreduce the calculation on n snips to the calculation on n−1 snips.

For H=(H₁, . . . H_(n)) hypothesis on n snips:

${\max\limits_{H}\; {F\left( {M,H} \right)}} = {\max\limits_{({H_{({1,{{ni}\; 1}})},H_{n}})}\; {F\left( {M,{\left( {H_{({1,{n - 1}})},H_{n}} \right) = {{\max\limits_{H_{n}}\mspace{11mu} {\max\limits_{H_{({1,{n - 1}})}}{F\left( {M,H_{({1,{n - 1}})},H_{n}} \right)}}} = {{\max\limits_{H_{n}}{{G\left( {M_{({1,n})},H_{n}} \right)}\mspace{20mu} {where}{G\left( {M_{({1,n})},H_{n}} \right)}}} = {\max\limits_{H_{({1,{n - 1}})}}\; {F\left( {M_{({1,n})},{\left( {H_{({1,{n - 1}})},H_{n}} \right) = {{\max\limits_{H_{({1,{n - 1}})}}{{F\left( {M_{({1,{n - 1}})},H_{({1,{n - 1}})}} \right)}*{P\left( M_{n} \middle| H_{n} \right)}*{{PF}\left( {H_{n - 1},H_{n}} \right)}}} = {{{P\left( M_{n} \middle| H_{n} \right)}*{\max\limits_{H_{({1,{n - 1}})}}{{F\left( {M_{({1,{n - 1}})},H_{({1,{n - 1}})}} \right)}*{{PF}\left( {H_{n - 1},H_{n}} \right)}}}} = {P\left( M_{n} \middle| H_{n} \right)*{\max\limits_{H_{n - 1}}{\max\limits_{H_{({1,{n - 2}})}}{F\left( {M_{({1,{n - 1}})},{{\left( {H_{({1,{n - 2}})},H_{n - 1}} \right)*{{PF}\left( {H_{n - 1},H_{n}} \right)}} = {{P\left( M_{n} \middle| H_{n} \right)}*{\max\limits_{H_{n - 1}}\; {{{PF}\left( {H_{n - 1},H_{n}} \right)}*{G\left( {M_{({1,{n - 1}})},H_{n -}} \right)}}}}}} \right.}}}}}}}} \right.}}}}}} \right.}}$

In summary:

${\max\limits_{H}\; {F\left( {M,H} \right)}} = {\max\limits_{H_{n}}\; {G\left( {M_{({1,n})},H_{n}} \right)}}$

where G can be found recursively: for i=2, . . . n

${G\left( {M_{({1,n})},H_{n}} \right)} = {{P\left( M_{n} \middle| H_{n} \right)}*{\max\limits_{H_{n - 1}}\left\lfloor {{{PF}\left( {H_{n - 1},H_{n}} \right)}*{G\left( {M_{({1,{n - 1}})},H_{n - 1}} \right)}} \right\rfloor}}$  and  G(M_((1, 1)), H₁) = 0.25 * P(M₁|H₁).

The best hypothesis can be found by following the following algorithm:

Step 1: For I=1, generate 4 hypotheses for H_(i), make G(M₁|H₁) for eachof these, and remember G₁,G₂, G₃,G₄Step 2: For I=2: generate 4 hypotheses for H₂, make G(M_((1,2))|H₂)using the above formula:

${{G\left( {M_{({1,2})},H_{2}} \right)} = {{P\left( M_{2} \middle| H_{2} \right)}*{\max\limits_{H_{1}}\left\lbrack {{{PF}\left( {H_{1},H_{2}} \right)}*{G\left( {M_{1},H_{1}} \right)}} \right\rbrack}}},$

remember these new four G_(n).Repeat step 2 for I=k with k_(i)=k_(i−1)+1 until k=n: generate 4hypothesis for H_(k), make G(M_((1,k))|H_(k))

${G\left( {M_{({1,k})},H_{k}} \right)} = {{P\left( M_{k} \middle| H_{k} \right)}*{\max\limits_{H_{k - 1}}\left\lfloor {{{PF}\left( {H_{k - 1},H_{k}} \right)}*{G\left( {M_{({1,{k - 1}})},H_{k - 1}} \right)}} \right\rfloor}}$

and remember these four G_(n).

Since there are only four hypotheses to remember at any time, and aconstant number of operations, the algorithm is linear.

To find P(M): P(H|M)=P(M|H)*P(H)/P(M)=F(M,H)/P(M))

As above:

$\mspace{20mu} \begin{matrix}{{P(M)} = {P\left( M_{({1,k})} \right)}} \\{= {\sum\limits_{H_{({1,k})}}{{P\left( M_{({1,k})} \middle| H_{({1,k})} \right)}*{P\left( H_{({1,k})} \right)}}}} \\{= {\sum\limits_{H_{k}}{{P\left( M_{K} \middle| H_{k} \right)}{\sum\limits_{H_{({1,{k - 1}})}}{{P\left( M_{({1,{k - 1}})} \middle| H_{({1,{k - 1}})} \right)}*}}}}} \\{{{P\left( H_{({1,{k - 1}})} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}} \\{= {\sum\limits_{H_{k}}{{P\left( M_{K} \middle| H_{k} \right)}*{W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)}}}}\end{matrix}$   where${W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)} = {\sum\limits_{H_{({1,{k - 1}})}}{{P\left( M_{({1,{k - 1}})} \middle| H_{({1,{k - 1}})} \right)}*{P\left( H_{({1,{k - 1}})} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}}$

W(M,H) can be solved by using recursion:

$\begin{matrix}{{W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)} = {\sum\limits_{H_{({1,{k - 1}})}}{{P\left( M_{({1,{k - 1}})} \middle| H_{({1,{k - 1}})} \right)}*{P\left( H_{({1,{k - 1}})} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}}} \\{= {\sum\limits_{H_{k - 1}}{{P\left( M_{k - 1} \middle| H_{k - 1} \right)}{\sum\limits_{H_{({1,{k - 2}})}}{{P\left( M_{({1,{k - 2}})} \middle| H_{({1,{k - 2}})} \right)}*}}}}} \\{{{P\left( H_{({1,{k - 2}})} \right)}*{{PF}\left( {H_{k - 2},H_{k - 1}} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}}} \\{= {\sum\limits_{H_{k - 1}}{{P\left( M_{k - 1} \middle| H_{k - 1} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}*}}} \\{{W\left( M_{({1,{k - 2}})} \middle| H_{k - 1} \right)}}\end{matrix}$

Therefore:

${W\left( M_{({1,{k - 1}})} \middle| H_{k} \right)} = {\sum\limits_{H_{k - 1}}{{P\left( M_{k - 1} \middle| H_{k - 1} \right)}*{{PF}\left( {H_{k - 1},H_{k}} \right)}*{W\left( M_{({1,{k - 2}})} \middle| H_{k - 1} \right)}}}$$\mspace{20mu} {{{and}\mspace{14mu} {W\left( M_{({1,1})} \middle| H_{2} \right)}} = {\sum\limits_{H_{1}}{{P\left( M_{1} \middle| H_{1} \right)}*0.25*{{PF}\left( {H_{1},H_{2}} \right)}}}}$

The algorithm is similar to the case above, where i=2:n and in each stepa new set of W(i) are generated until the final step yields theoptimized W.

Deriving the p₁, p₂, pp₁, pp₂ values from d₁, d₂, h, pd₁, pd₂, ph

For the purpose of explanation, this section will focus on the paternaldiploid and haploid data, but it is important to note that maternal datacan be treated similarly. Let:

d₁, d₂—allele calls on the diploid measurements

h—allele call on the haploid measurement

p_(d1), p_(d2)-probabilities of a correct allele call on the each of thediploid measurements

p_(h)—probability of a correct allele call on the haploid measurement

The data should be mapped to the following input parameters fordisclosed algorithm:

p₁—allele corresponding to haploid cell and one of the diploid cells

p₂—allele corresponding to the remaining diploid cell

p_(p1), p_(p2)—probabilities of correct allele call

Since h corresponds to d₁, then to find the value of p₁ it is necessaryto use h and d₁. Then p₂ will automatically correspond to d₂. Similarly,if h corresponds to d₂, then to find the value of p₁ it is necessary touse h and d₂, and then p₂ will correspond to d₁.

The term “correspond” is used since it can mean either “be equal” or“originate with higher probability from” depending on differentmeasurement outcomes and population frequency.

The goal of the algorithm is to calculate probabilities of “true” allelevalues hidden beyond results of raw measurement h, d₁, d₂, p_(h),p_(d1), p_(d2) and population frequencies.

The basic algorithm steps are the following:

-   -   (i) determine whether h corresponds to d₁ or d₂ based on h, d₁,        d₂, p_(h), p_(d1), p_(d2) values and the population frequency        data    -   (ii) assign the allele calls to p₁ and p₂; calculate the        probabilities p_(p1) and p_(p2) based on step (1)        Assigning h to d₁ or d₂

Establish two hypotheses:

H₁: h corresponds to d₁ (h originates from d₁)

H₂: h corresponds to d₂ (h originates from d₂)

The task is to calculate probabilities of these two hypotheses given themeasurement M:

P(H ₁ /M(h,d ₁ ,d ₂ ,p _(h) ,p _(d1) ,p _(d2))) and P(H ₂ /M(h,d ₁ ,d ₂,p _(h) ,p _(d1) ,p _(d2)))

(To simplify the text, these will be referred to as P(H₁/M) and P(H₂/M))hereafter.

In order to calculate these probabilities, apply the Bayesian rule:

${{P\left( H_{1} \middle| M \right)} = \frac{{P\left( M \middle| H_{1} \right)}*{P\left( H_{1} \right)}}{P(M)}};$${{P\left( H_{2} \middle| M \right)} = \frac{{P\left( M \middle| H_{2} \right)}*{P\left( H_{2} \right)}}{P(M)}},$

where P(M)=P(M/H₁)*P(H₁)+P(M/H₂)*P(H₂). Since hypotheses H1 and H2 areequally likely, P(H₁)=P(H₂)=0.5, therefore:

${{P\left( H_{1} \middle| M \right)} = {\frac{P\left( M \middle| H_{1} \right)}{{P\left( M \middle| H_{1} \right)} + {P\left( M \middle| H_{2} \right)}}\mspace{14mu} {and}}}\;$${P\left( H_{2} \middle| M \right)} = \frac{P\left( M \middle| H_{2} \right)}{{P\left( M \middle| H_{1} \right)} + {P\left( M \middle| H_{2} \right)}}$

In order to calculate P(M/H₁) and P(M/H₂), one must consider the set ofall possible values of diploid outcomes d₁ and d₂, Ω={AA,AC, . . . ,GG},i.e. any combination of A,C,T,G, so called underlying states. When thehypotheses are applied to the underlying states (i.e. accompany theassumed value of h based on hypothesis H₁ or H₂, to values d₁ and d₂),the following tables of all possible combinations (states S={s₁,s₂, . .. ,s₁₆}) of “true values” H, D₁ and D₂ for h, d₁ and d₂, can begenerated, respectively. These are listed here:

Hypothesis H₁: h = d₁ Ω = {AA, AC, . . . , GG} state H D₁ D₂ s₁ A A A s₂A A C s₃ A A T s₄ A A G s₅ C C A s₆ C C C s₇ C C T s₈ C C G s₉ T T A s₁₀ T T C  s₁₁ T T T  s₁₂ T T G  s₁₃ G G A  s₁₄ G G C  s₁₅ G G T  s₁₆ GG G

Hypothesis H₂: h = d₂ Ω = {AA, AC, . . . GG} state H D₁ D₂ s₁ A A A s₂ CA C s₃ T A T s₄ G A G s₅ A C A s₆ C C C s₇ T C T s₈ G C G s₉ A T A  s₁₀C T C  s₁₁ T T T  s₁₂ G T G  s₁₃ A G A  s₁₄ C G C  s₁₅ T G T  s₁₆ G G GSince the “true values” H, D₁ and D₂ are unknown, and only the rawmeasurement outcomes h, d₁, d₂, p_(h), p_(d1), p_(d2), are known, thecalculation of the P(M/H₁) and P(M/H₂) over the entire set Ω must beperformed in the following manner:

${P\left( M \middle| H_{1} \right)} = {\sum\limits_{{({D_{1},D_{2}})} \in \Omega}{{P\left( {{{\left. {M\left( {h,d_{1},d_{2}} \right)} \middle| H_{1} \right.\&}\mspace{11mu} D_{1}},D_{2}} \right)}*{P\left( {D_{1},D_{2}} \right)}}}$${P\left( M \middle| H_{2} \right)} = {\sum\limits_{{({D_{1},D_{2}})} \in \Omega}{{P\left( {{{\left. {M\left( {h,d_{1},d_{2}} \right)} \middle| H_{2} \right.\&}\mspace{11mu} D_{1}},D_{2}} \right)}*{P\left( {D_{1},D_{2}} \right)}}}$

If, for the purpose of the calculation, one assumes that d₁ and d₂, aswell as p_(d1) and p_(d2) are independent variables, it can be shownthat:

${P\left( M \middle| H_{1} \right)} = {{\sum\limits_{\Omega}{{P\left( {{{\left. {M\left( {h,d_{1},d_{2}} \right)} \middle| H_{1} \right.\&}\mspace{11mu} D_{1}},D_{2}} \right)}*{P\left( {D_{1},D_{2}} \right)}}} = {\sum\limits_{S}{{P\left( {M(h)} \middle| H \right)}*{P\left( {M\left( d_{1} \right)} \middle| D_{1} \right)}*{P\left( {M\left( d_{2} \right)} \middle| D_{2} \right)}*{P\left( D_{1} \right)}*{P\left( D_{2} \right)}}}}$

Consider the first three terms under the last sum above: P(M(x)/X), forx in {h,d₁,d₂}.

The calculation of the probability of correct allele call (hitting the“true allele value”) is based on measurement of outcome x given the truevalue of allele X. If the measured value x and the true value X areequal, that probability is p_(x) (the probability of correctmeasurement). If x and X are different, that probability is (1-p_(x))/3.For example, calculate the probability that the “true value” C is foundunder the conditions that X=C, and the measured value is x=A. Theprobability of getting A is p_(x). The probability of getting C, T or Gis (1-p_(x)). So, the probability of hitting C is (1−p_(x))/3, since onecan assume that C, T and G are equally likely.

If the indicator variable I_(x) is included in the calculation, whereI_(x)=1 if x=X and I_(x)=0 if x≠X, the probabilities are as follows:

P(M(x)/X)=I _({x=X) }*p _(x)+(1−I _({x=X)})*(⅓)*(1-p _(x)),x in {h,d ₁,d ₂}

Now consider the last two terms in P(M|H₁). P(D₁) and P(D₂) arepopulation frequencies of alleles A,C,T and G, that may be known fromprior knowledge.

Consider the expression shown above for a particular state s₂, given theparticular measurement M(h=A,d₁=G,d₂=C):

P(M(h)|H) * P(M(d₁)|D₁) * P(M(d₂)|D₂) * P(D₁) * P(D₂) =  = P(M(h) = A|H = A) * P(M(d₁) = G|D₁ = A) * P(M(d₂) = C|D₂ = C) * P(D₁ = A) * P(D₂ = C) =  = p_(h) * ((1 − p_(d 1))/3) * p_(d 2) * f(D₁ = A) * f(D₂ = C)

Similarly, calculate (1) given the particular measurement (in this caseM(h=A,d₁=G,d₂=C)) for remaining 15 states and sum over the set Ω.

Now P(M/H₁) and P(M/H₂) have been calculated. Finally, calculate P(H₁/M)and P(H₁/M) as described before:

${P\left( H_{1} \middle| M \right)} = \frac{P\left( M \middle| H_{1} \right)}{{P\left( M \middle| H_{1} \right)} + {P\left( M \middle| H_{2} \right)}}$${P\left( H_{2} \middle| M \right)} = \frac{P\left( M \middle| H_{2} \right)}{{P\left( M \middle| H_{1} \right)} + {P\left( M \middle| H_{2} \right)}}$

Assigning the Allele Calls and Corresponding Probabilities

Now establish four different hypotheses:

H_(p2A): “true value” of p₂ is AH_(p2C): “true value” of p₂ is CH_(p2T): “true value” of p₂ is TH_(p2G): “true value” of p₂ is Gand calculate P(H_(p2A)/M), P(H_(p2C)/M), P(H_(p2T)/M), P(H_(p2G)/M).The highest value determines the particular allele call andcorresponding probability.

Since the origin of p₂ is unknown (it is derived from d₁ withprobability of P(H₂/M) and from d₂ with probability P(H₁/M)), one mustconsider both cases that p₂ allele originates from d₁ or d₂. ForHypothesis H_(A), applying Bayes rule, give:

P(H _(p2A) |M)=P(H _(p2A) |M,H)*P(H ₁ |M)+P(H _(p2A) |M,H ₂)*P(H ₂ |M)

P(H₁/M) and P(H₂/M) have already been determined in step 1. By Bayesrule:

${P\left( {\left. H_{p\; 2A} \middle| H_{1} \right.,M} \right)} = \frac{{P\left( {\left. M \middle| H_{1} \right.,H_{p\; 2A}} \right)}*{P\left( {H_{1},H_{p\; 2\; A}} \right)}}{P\left( {H_{1},M} \right)}$

Since H₁ implies that p₂ originates from d₂:

P(M|H ₁ ,H _(p2A))=P(M(d ₂)|D ₂ =A)=I _({d) ₂ _(=D) ₂ _(}) *p _(d2)+(1−I_({d) ₂ _(D) ₂ _(}))*(1−p _(d2)2)/3

P(H_(i),M/H_(p2A))=P(M(d₂)/D₂=A)=I_({d2=D2})*p_(d2)+(1−I_({d2=D2}))*(⅓)*(1−p_(d2)),as described before.P(H_(p2A))=P(D₂=A)=f_(d2)(A), where f_(d2)(A) is obtained frompopulation frequency data.P(H₁,M)=P(H₁,M/H_(p2A))*P(H_(p2A))+P(H₁,M/H_(p2C))*P(H_(p2C))+P(H₁,M/H_(p2T))*P(H_(p2T))+P(H₁,M/H_(p2 G))*P(H_(p2G)).Similarly, calculate P(H_(p2A)&H₂/M).P(H_(p2A)/M)=P(H_(p2A)&H₁/M)+P(H_(p2A)&H₂/M), therefore the probabilitythat p₂ is equal to A has been calculated. Repeat the calculation forC,T, and G. The highest value will give the answer of p₂ allele call andthe corresponding probability.

Assigning the Allele Call to p₁ (Allele Corresponding to the HaploidCell and One of the Diploid Cells)

As before, we establish four different hypotheses:

H_(p1A): “true value” of p₁ is AH_(p1C): “true value” of p₁ is CH_(p1T): “true value” of p₁ is TH_(p1G): “true value” of p₁ is Gand calculate P(H_(p1A)/M), P(H_(p1C)/M), P(H_(p1T)/M), P(H_(p1G)/M)

Here is an elaboration of H_(p1A). In the “true case” case, p₁ will beequal to A only if the haploid and the corresponding diploid cell areequal to A. Therefore, in order to calculate p₁ and p_(p1) one mustconsider situations where haploid and corresponding diploid cell areequal. So, the hypothesis H_(p1A): the “true value” of p₁ is A andbecomes H_(hdA): the “true value” of the haploid cell and correspondingdiploid cell is A.

Since the origin of h is unknown (it is derived from d₁ with probabilityof P(H₁/M) and from d₂ with probability P(H₂/M)), one must consider bothcases, that h allele originates from d₁ or d₂, and implement that indetermination of p₁. That means, using Bayes rule:

P(H _(hdA) |M)=P(H _(hdA) |M,H ₁)*P(H ₁ |M)+P(H _(hdA) |M,H ₂)*P(H ₂ |M)

As before, P(H₁/M) and P(H₂/M) are known from previous calculations.

$\mspace{20mu} {{P\left( {\left. H_{hdA} \middle| H_{1} \right.,M} \right)} = \frac{{P\left( {H_{1},\left. M \middle| H_{hdA} \right.} \right)}*{P\left( H_{hdA} \right)}}{P\left( {H_{1},M} \right)}}$P(H₁, M/H_(hdA)) = P(M(h)/H = A) * P(M(d₁)/D₁ = A) =  =   [I_({h = H}) * p_(h) + (1 − I_({h = H})) * (1/3) * (1 − p_(h))] *   [I_({ d 1 = D 1}) * p_(d 1) + (1 − I_({d 1 = D 1})) * (1/3) * (1 − p_(d 1))],

since H1 implies that p₁ originates from d₁.P(H_(hdA))=P(h=A)*P(D₁=A)=f_(h)(A)*f_(d1)(A), where f_(h)(A) andf_(d2)(A) are obtained from population frequency data.P(H₁,M)=P(H₁,M/H_(hdA))*P(H_(hdA))+P(H₁,M/H_(hdc))*P(H_(hdC))+P(H₁,M/H_(hdT))*P(H_(hdT))+P(H₁,M/H_(hdG))*P(H_(hdG))

Similarly, calculate P(H_(hdA)&H₂/M).

P(H_(hdA)/M)=P(H_(hdA)&H₁/M)+P(H_(hdA)&H₂/M) and now we have calculatedthe probability that p₁ is equal to A. Repeat the calculation for C,T,and G. The highest value will give the answer of p₁ allele call andcorresponding probability.

Example Input

Two input examples are shown. The first example is of a set of SNPs witha low tendency to cosegregate, that is, SNPs spread throughout achromosome, and the input data is shown in FIG. 21. The second exampleis of a set of SNPs with a high tendency to cosegregate, that is SNPsclustered on a chromosome, and the input data is shown in FIG. 22. Bothsets of data include an individual's measured SNP data, the individual'sparents SNP data, and the corresponding confidence values. Note thatthis data is actual data measured from actual people. Each row representmeasurements for one particular SNP location. The columns contain thedata denoted by the column header. The key to the abbreviations in thecolumn headers is as follows:

-   -   family_id=the unique id for each person (included for clerical        reasons)    -   snp_id=the SNP identification number    -   e1, e2=the SNP nucleotide values for the embryo    -   p1, p2=the SNP nucleotide values for the father    -   m1, m2=the SNP nucleotide values for the mother    -   pe1, pe2=the measurement accuracy for e1,e2    -   pp1, pp2=the measurement accuracy for p1,p2    -   pm1, pm2=the measurement accuracy for m1,m2

Example Output

The two examples of output data are shown in FIG. 23 and FIG. 24, andcorrespond to the output data from the data given in FIG. 21 and FIG. 22respectively. Both figures show an individual's measured SNP data, theindividual's parents SNP data, the most likely true value of theindividual's SNP data, and the corresponding confidences. Each rowrepresents the data corresponding to one particular SNP. The columnscontain the data denoted by the column header. The key to theabbreviations in the column headers is as follows:

-   -   snp_id=the SNP identification number    -   true_value=the proposed nucleotide value for e1,e2    -   true_hyp=the hypothesis for the origin of e1,e2    -   ee=the measured SNP nucleotide values for e1,e2    -   pp=the measured SNP nucleotide values for p1,p2    -   mm=the measured SNP nucleotide values for m1,m2    -   HypProb=the probability of the final hypothesis. There is only        one number for the output, but due to the excel column        structure, this number is replicated in all rows.

Note that this algorithm can be implemented manually, or by a computer.FIG. 21 and FIG. 22 show examples of input data for a computerimplemented version of the method. FIG. 23 shows the output data for theinput data shown in FIG. 21. FIG. 24 shows the output data for the inputdata shown in FIG. 22.

Simulation Algorithm

Below is a second simulation which was done to ensure the integrity ofthe system, and to assess the actual efficacy of the algorithm in awider variety of situations. In order to do this, 10,000 full systemsimulations were run. This involves randomly creating parental geneticdata, emulating meiosis in silico to generate embryonic data, simulatingincomplete measurement of the embryonic data, and then running themethod disclosed herein to clean the simulated measured embryonic data,and then comparing the simulated cleaned data with the simulated realdata. A more detailed explanation of the simulation is given below, andthe visual representation of the flow of events is given in FIG. 18. Twodifferent implementations of the theory were tested. A fullerexplanation is given below.

Simulation Algorithms for DH and PS and Results

For both algorithms, the initial input variables are:

-   -   (i) the list of SNPs to test,    -   (ii) the population frequency of the maternal (popfreqlistMM)        and paternal (popfreqlistPP) chromosomes,    -   (iii) the probabilities of a correct allele call for haploid        measurement (ph,pe), and for unordered diploid measurements        (pd).

These values should be fixed based on the results from empirical data(population frequency) on relevant SNPs, and from measuringinstrumentation performance (p_(h),p_(d),pe). The simulation was run forseveral scenarios such as most likely (informed), uniform (uninformed)and very unlikely (extreme case).

Once the above static parameters are fixed, crossover probabilitiesgiven the particular SNPs are the same for all the simulations, and willbe derived ahead of the time given the databases for sniplocation(SNIPLOC_NAME_MAT) and genetic distance (HAPLOC_NAME_MAT).

[crossprob,snips]=GetCrossProb(snips,SNIPLOC_NAME_MAT,parameters,HAPLOC_NAME_MAT);

Preliminary Simulation Loop

The preliminary simulation loop is to demonstrate that the genetic datathat will be used for the full simulation is realistic. Steps 1 through5 were repeated 10,000 times. Note that this simulation can be run foreither or both parents; the steps are identical. In this case, thesimulation will be run for the paternal case for the purposes ofillustration, and the references to FIG. 18 will also include thecorresponding maternal entry in FIG. 18 in parentheses. This simulationwas also run using Matlab.

Step 1: Generate Original Parental Diploid Cells (P1,P2),

[P1,P2]=GenerateOriginalChromosomes(snips,popfreqlistPP); 1801 (1802)

Generate original paternal cells depending on the population frequencyfor each SNP for father cells.

Step 2: Generate Haploid and Unordered Diploid Data for DHAlgo

Simulate crossover of the parental chromosomes 1803 to give two sets ofchromosomes, crossed over: P1C1, P2C1 and P1C2, P2C2; 1804 (1805). Pickone of the father alleles after the crossover 1806 (from the first set)for haploid allele HP 1807 (1808) in this case P1 (since there is nodifference which one), and mix up the order in the diploid alleles toget (D1P,D2P) 1807 (1808).

-   -   HP=PickOne(P1C1,P2C1);    -   [D1P,D2P]=Jumble(P1,P2).

Step 3: Introduce Error to the Original Dataset in Order to SimulateMeasurements

Based on given probabilities of correct measurement (ph—haploid,pd—diploid measurement), introduce error into the measurements to givethe simulated measured parental data 1811 (1812).

-   -   hp=MakeError(HP,ph);    -   d1p=MakeError(D1P,pd);    -   d2p=MakeError(D2P,pd).        Step 4: Apply DHAlgo to get (p1,p2), (pp1,pp2)

DHAlgo takes alleles from haploid cell and unordered alleles fromdiploid cell and returns the most likely ordered diploid alleles thatgave rise to these. DHAlgo attempts to rebuild (P1,P2), also returnsestimation error for father (pp1,pp2). For comparison, the empiricalalgorithm that does simple allele matching is also used. The goal is tocompare how much better is the disclosed algorithm, compared to thesimple empirical algorithm.

[p1,p2,pp1,pp2]=DHAlgo(hp,dlp,d2p,ph,pd,snips,popfreqlistPP,‘DH’);[p1s,p2s,pp1s,pp2s]=_(DH)Algo(hp,d1p,d2p,ph,pd,snips,popfreqlistPP,‘ST’);

Step 5: Collect Statistics for the Run

Compare (P1,P2) to derived (p₁,p₂).

[P1cmp(:,i),P2cmp(:,i),P1prob(:,i),P2prob(:,i),P1mn(i),P2mn(i)]=DHSimValidate(P1,P2,p1,p₂ ,pp1,pp2);

Note: (P1S_(i),P2S_(i),P1P_(i),P2P_(i),P1A_(i),P2A_(i))=(I_({P1=p1)},I_({P2=p2}), p_(p1),p_(p2),p1 _(acc), p2_(acc)), where I_({P1=p1}) isbinary indicator array for estimation of DH algorithm accuracy for allthe SNPs, similarly, for I_({P2=p2})·p_(p1),p_(p2) are probabilities ofa correct allele call derived from the algorithm, andp1_(acc)=mean(I_({P1=p1})), i.e. average accuracy for this run for p1,similar for p2_(acc).

Preliminary Simulation Results

Ten thousand simulations were used to estimate the algorithm accuracyDHAccuracy.P1=mean(P1A_(i)), DHAccuracy.P2=mean(P2A_(i)), which showsthe overall accuracy of the DH algorithm from P1,P2. On an individualSNP basis, the average accuracy on each SNP SNPAcc.P1=mean(P1S_(i))should agree with the average of the estimated probability of correctlymeasuring that SNP, SNPProb.P1=mean(P2P_(i)), i.e. if the algorithmworks correctly, the value for SNPAcc.Plshould correspond closely toSNPProb.P1. The relationship between these two is reflected by theircorrelation.

The 10000 loops of the simulation were run for different setupscenarios:

-   (1) The underlying population frequency was given by existing    genotyping data which is more realistic, and uniform population    frequencies where A,C,T,G have the same probability on each SNP.-   (2) Several combinations for measurement accuracy for haploid and    unordered diploid measurements (ph,pd). Varying assumptions were    made; that the measurements are both very accurate (0.95,0.95), less    accurate (0.75,0.75) and inaccurate or random (0.25,0.25), as well    as unbalanced combinations of (0.9, 0.5), (0.5, 0.9). What might be    closest to reality may be accuracies of approximately 0.6 to 0.8.-   (3) The simulation was run in all these cases for both the    DHAlgorithm and simple matching STAlgorithm, in order to assess the    performance of the disclosed algorithm.    The results of all these runs are summarized in FIG. 25.

The disclosed algorithm performs better than the existing empiricalalgorithm in these simulations, especially for the realistic cases ofnon-uniform population frequency, and unbalanced or reducedprobabilities of correct measurements. It has also been confirmed thatour estimates of the algorithm accuracy for individual SNPs are verygood in these cases, since the correlation between the estimatedaccuracy of correct allele call and simulation average accuracy isaround 99%, with average ratio of 1.

In the most realistic case, for data population frequency and(ph,pd)=(0.6, 0.8), the average percent of correctly retrieved SNPs for(P1,P2) is (0.852, 0.816) in implementation 1, and (0.601, 0.673) inimplementation 2.

Note that for FIG. 25 and FIG. 26 the rows beginning with “data” usepopulation frequency data was taken from empirical results, while therows beginning with “uniform” assume uniform populations.

It is important to note that in FIG. 25 and FIG. 26 the accuracy isdefined as the average percent of SNPs where the correct SNP call wasmade and the correct chromosome of origin was identified. It is alsoimportant to note that these simulations reflect two possibleimplementations of the algorithm. There may be other ways to implementthe algorithm that may give better results. This simulation is onlymeant to demonstrate that the method can be reduced to practice.

Full Simulation Loop

Steps 1-8 were repeated 10000 times. This is the simulation to test thefull disclosed method to clean measured genetic data for a targetindividual using genetic data measured from related individuals, in thiscase, the parents. This simulation was run using Matlab.

Step 1: Generate Original Parental Diploid Cells (P1,P2),(M1,M2)

[P1,P2]=GenerateOriginalChromosomes(snip s,popfreqlistPP); (1801)

[M1,M2]=GenerateOriginalChromosomes(snips,popfreqlistMM); (1802)

Generate original parental cells depending on the population frequencyfor each SNP for mother and father cells.

Step 2: Crossover Parental Cells (PIC,P2C), (M1C,M2C) (1803)

Generate two sets of paternal cells with crossovers: first to get(P1C1,P2C1) used in DHAlgo, and second time to get (P1C2,P2C2) used inPSAlgo. (1804)

Generate two sets of maternal cells with crossovers: first to get(M1C1,M2C1) used in DHAlgo, and (M1C2,M2C2) used in PSAlgo. (1805)

-   -   [P 1C1,P2C1]=Cross(P 1,P2,snips,fullprob);    -   [P1C2,P2C2]=Cross(P 1,P2,snips,fullprob);    -   [M1C1,M2C1]=Cross(M 1,M2,snips,fullprob);    -   [M1C2,M2C2]=Cross(M1,M2,snips,fullprob);

Step 3 Make Haploid Cell and Unordered Diploid Cells for DHAlgo (1806)

Pick one of the sets of paternal cells (1804, first set) for haploidcell HP, and mix up the order in the diploid cell to get (D1P,D2P)(1807). Do the same for mother cells (1805, first set) to get MH,(D1M,D2M). (1808).

-   -   HP=PickOne(P1C1,P2C1);    -   HM=PickOne(M1C1,M2C1);    -   [D1P,D2P]=Jumble(P1,P2);    -   [D1M,D2M]=Jumble(M1,M2);

Step 4: Make Diploid Embryo Cell (1809)

Pick one of the paternal cells (1804, second set) and one of thematernal cells (1805, second set) for embryo cell. Mix up the order formeasurement purposes.

-   -   E1=PickOne(P1C2,P2C2);    -   E2=PickOne(M1C2,M2C2);    -   [E1J,E2J]=Jumble(E1,E2); (1810)        Step 5: Introduce error to the measurements (1811, 1812, 1813)

Based on given measurement error (p_(h)—haploid cells, p_(d)—unordereddiploid cells, pe—embryo cells), introduce error into the measurements.

-   -   hp=MakeError(HP,ph); (1811)    -   d1p=MakeError(D1P,pd); (1811)    -   d2p=MakeError(D2P,pd); (1811)    -   hm=MakeError(HM,ph); (1812)    -   dim=MakeError(D1M,pd); (1812)    -   d2m=MakeError(D2M,pd); (1812)    -   e1=MakeError(E1J,pe1); (1813)    -   e2=MakeError(E2J,pe2); (1813)        Step 6: Apply DHAlgo to get (p1,p2),(m1,m2), (pp1,pp2),(pm1,pm2)

DHAlgo takes a haploid cell and an unordered diplod cell and returns themost likely ordered diploid cell that gave rise to these. DHAlgoattempts to rebuild (P1C1,P2C1) for father and (M1C1,M2C1) for motherchromosomes, also returns estimation error for father (pp1,pp2) andmother (pm1,pm2) cells.

-   -   [p1,p₂,p_(p)1,p_(p)2]=DHAlgo(hp,d1p,d2p,snips,popfreqlistPP);        (1814)    -   [m1,m2,pm1,pm2]=DHAlgo(hm,d1m,d2m,snips,popfreqlistMM); (1815)

Step 7: Apply PSAlgo to get (DE1,DE2) (1816)

PSAlgo takes rebuilt parent cells (p1,p2,m1,m2) and unordered measuredembryo cell (e1,e2) to return most likely ordered true embryo cell(DE1,DE2). PS Algo attempts to rebuild (E1,E2).

[DE1,DE2,alldata]=PSAlgo(snips,e1,e2,p1,p2,m1,m2,pe,pp1,pp2,pm1,pm2,parameters,crossprob,popfreqlistPP,popreqlistMM);

Step 8: Collect Desired Statistics from this Simulation Run

Get statistics for the run:

simdata=SimValidate(alldata,DE1,DE2,P1P2,M1,M2,E1,E2,p1,p2,m1,m2,e1,e2,pe,pe,pp1,pp2,pm1,pm2);

Simulation Results

Ten thousand simulations were run and the final estimates for thealgorithm accuracy PSAccuracy.E1=mean(E1A_(i)),PSAccuracy.E2=mean(E2A_(i)), which tells us the overall accuracy of thePS algorithm from E1,E2 were calculated. On an individual SNP basis, theaverage accuracy on each SNP SNPAcc.E1=mean(E1S_(i)) should agree withthe average of the estimated probability of correctly measuring thatSNP, SNPProb.E1=mean(E2P_(i)), i.e. if the algorithm is writtencorrectly, then SNPAcc.E1 should be observed to correlate to SNPProb.E1.The relationship between these two is reflected by their correlation.

Ten thousand loops of the simulation have been run for different setupscenarios:

-   (1) Underlying population frequency given by existing genotyping    data which is more realistic, and uniform population frequencies    where A,C,T,G have the same probability on each SNP.-   (2) Several combinations of measurement accuracy for haploid,    unordered diploid and embryo measurements (ph,pd,pe). A variety of    accuracies were simulated: very accurate (0.95,0.95,0.95), less    accurate (0.75,0.75,0.75) and inaccurate or random (0.25,0.25,0.25),    as well as unbalanced combinations of (0.9, 0.5,0.5), (0.5,    0.9,0.9). What may be closest to reality is approximately    (0.6,0.8,0.8).-   (3) We ran the simulation in all these cases for both our    PSAlgorithm and simple matching STPSAlgorithm, in order to assess    the performance of the disclosed algorithm.    The results of these runs are summarized in the FIG. 26.

The disclosed algorithm performs better than the existing empiricalalgorithm in these simulations, especially for the realistic cases ofnon-uniform population frequency, and unbalanced or reducedprobabilities of correct measurements. It has also been shown that theestimates of the algorithm accuracy for individual SNPs are very good inthese cases, since the correlation between the estimated accuracy ofcorrect allele call and simulation average accuracy is around 99%, withaverage ratio of 1.

In the most realistic case, for data population frequency and(ph,pd,pe)=(0.6, 0.8, 0.8), the average percent of correctly retrievedSNPs for (E1,E2) is (0.777, 0.788) in implementation 1 and (0.835,0.828) in implementation 2. As mentioned above, the number denoting theaverage accuracy of algorithm refers not only to the correct SNP call,but also the identification of correct parental origin of the SNP. To beeffective, an algorithm must return better results than the algorithmthat simply accepts the data as it is measured. One might be surprisedto see that in some cases, the accuracy of the algorithm is lower thanthe listed accuracy of measurement. It is important to remember that forthe purposes of this simulation a SNP call is considered accurate onlyif it is both called correctly, and also its parent and chromosome oforigin are correctly identified. The chance of getting this correct bychance is considerably lower than the measurement accuracy.

Laboratory Techniques

There are many techniques available allowing the isolation of cells andDNA fragments for genotyping. The system and method described here canbe applied to any of these techniques, specifically those involving theisolation of fetal cells or DNA fragments from maternal blood, orblastocysts from embryos in the context of IVF. It can be equallyapplied to genomic data in silico, i.e. not directly measured fromgenetic material.

In one embodiment of the system, this data can be acquired as describedbelow.

Isolation of Cells

Adult diploid cells can be obtained from bulk tissue or blood samples.Adult diploid single cells can be obtained from whole blood samplesusing FACS, or fluorescence activated cell sorting. Adult haploid singlesperm cells can also be isolated from sperm sample using FACS. Adulthaploid single egg cells can be isolated in the context of eggharvesting during IVF procedures.

Isolation of the target single blastocysts from human embryos can bedone following techniques common in in vitro fertilization clinics.Isolation of target fetal cells in maternal blood can be accomplishedusing monoclonal antibodies, or other techniques such as FACS or densitygradient centrifugation.

DNA extraction also might entail non-standard methods for thisapplication. Literature reports comparing various methods for DNAextraction have found that in some cases novel protocols, such as theusing the addition of N-lauroylsarcosine, were found to be moreefficient and produce the fewest false positives.

Amplification of Genomic DNA

Amplification of the genome can be accomplished by multiple methodsincluding: ligation-mediated PCR (LM-PCR), degenerate oligonucleotideprimer PCR (DOP-PCR), and multiple displacement amplification (MDA). Ofthe three methods, DOP-PCR reliably produces large quantities of DNAfrom small quantities of DNA, including single copies of chromosomes;this method may be most appropriate for genotyping the parental diploiddata, where data fidelity is critical. MDA is the fastest method,producing hundred-fold amplification of DNA in a few hours; this methodmay be most appropriate for genotyping embryonic cells, or in othersituations where time is of the essence.

Background amplification is a problem for each of these methods, sinceeach method would potentially amplify contaminating DNA. Very tinyquantities of contamination can irreversibly poison the assay and givefalse data. Therefore, it is critical to use clean laboratoryconditions, wherein pre- and post-amplification workflows arecompletely, physically separated. Clean, contamination free workflowsfor DNA amplification are now routine in industrial molecular biology,and simply require careful attention to detail.

Genotyping Assay and Hybridization

The genotyping of the amplified DNA can be done by many methodsincluding MOLECULAR INVERSION PROBES (MIPs) such as AFFMETRIX's GENFLEXTag Array, microarrays such as AFFMETRIX's 500K array or the ILLUMINABead Arrays, or SNP genotyping assays such as APPLIED BIOSYSTEMS'sTAQMAN assay. The AFFMETRIX 500K array, MIPs/GENFLEX, TAQMAN andILLUMINA assay all require microgram quantities of DNA, so genotyping asingle cell with either workflow would require some kind ofamplification. Each of these techniques has various tradeoffs in termsof cost, quality of data, quantitative vs. qualitative data,customizability, time to complete the assay and the number of measurableSNPs, among others. An advantage of the 500K and ILLUMINA arrays are thelarge number of SNPs on which it can gather data, roughly 250,000, asopposed to MIPs which can detect on the order of 10,000 SNPs, and theTAQMAN assay which can detect even fewer. An advantage of the MIPs,TAQMAN and ILLUMINA assay over the 500K arrays is that they areinherently customizable, allowing the user to choose SNPs, whereas the500K arrays do not permit such customization.

In the context of pre-implantation diagnosis during IVF, the inherenttime limitations are significant; in this case it may be advantageous tosacrifice data quality for turn-around time. Although it has other clearadvantages, the standard MIPs assay protocol is a relativelytime-intensive process that typically takes 2.5 to three days tocomplete. In MIPs, annealing of probes to target DNA andpost-amplification hybridization are particularly time-intensive, andany deviation from these times results in degradation in data quality.Probes anneal overnight (12-16 hours) to DNA sample. Post-amplificationhybridization anneals to the arrays overnight (12-16 hours). A number ofother steps before and after both annealing and amplification bring thetotal standard timeline of the protocol to 2.5 days. Optimization of theMIPs assay for speed could potentially reduce the process to fewer than36 hours. Both the 500K arrays and the ILLUMINA assays have a fasterturnaround: approximately 1.5 to two days to generate highly reliabledata in the standard protocol. Both of these methods are optimizable,and it is estimated that the turn-around time for the genotyping assayfor the 500k array and/or the ILLUMINA assay could be reduced to lessthan 24 hours. Even faster is the TAQMAN assay which can be run in threehours. For all of these methods, the reduction in assay time will resultin a reduction in data quality, however that is exactly what thedisclosed invention is designed to address. Some available techniquesthat are faster are not particularly high-throughput, and therefore arenot feasible for highly parallel prenatal genetic diagnosis at thistime.

Naturally, in situations where the timing is critical, such asgenotyping a blastocyst during IVF, the faster assays have a clearadvantage over the slower assays, whereas in cases that do not have suchtime pressure, such as when genotyping the parental DNA before IVF hasbeen initiated, other factors will predominate in choosing theappropriate method. For example, another tradeoff that exists from onetechnique to another is one of price versus data quality. It may makesense to use more expensive techniques that give high quality data formeasurements that are more important, and less expensive techniques thatgive lower quality data for measurements where the fidelity is notcritical. Any techniques which are developed to the point of allowingsufficiently rapid high-throughput genotyping could be used to genotypegenetic material for use with this method.

Combinations of the Aspects of the Invention

As noted previously, given the benefit of this disclosure, there aremore aspects and embodiments that may implement one or more of thesystems, methods, and features, disclosed herein. Below is a short listof examples illustrating situations in which the various aspects of thedisclosed invention can be combined in a plurality of ways. It isimportant to note that this list is not meant to be comprehensive; manyother combinations of the aspects, methods, features and embodiments ofthis invention are possible.

One example is the system which may operate in an IVF laboratory (seeFIG. 19) that would allow full genotyping of all viable embryos withinthe time constraints of the IVF procedure. This would require aturn-around time from egg fertilization to embryo implantation of underthree days. This system may consist of parental genetic samples 1901from IVF user (mother) 1902 and IVF user (father) 1903 being analyzed atIVF lab 1904 using a genotyping system. It may involve multiple eggsthat are harvested from the mother 1902 and fertilized with sperm fromthe father 1903 to create multiple fertilized embryos 1905. It mayinvolve a laboratory technician extracting a blastocyst for each embryo,amplifying the DNA of each blastocyst, and analyzing them using a highthroughput genotyping system 1906. It may involve sending the geneticdata from the parents and from the blastocyst to a secure dataprocessing system 1907 which validates and cleans the embryonic geneticdata. It may involve the cleaned embryonic data 1908 being operated onby a phenotyping algorithm 1909 to predict phenotype susceptibilities ofeach embryo. It may involve these predictions, along with relevantconfidence levels, being sent to the physician 1910 who helps the IVFusers 1902 and 1903 to select embryos for implantation in the mother1901.

Another example could utilize a variety of genotyping measurementtechniques in a way that would optimize the value of each. For example,a lab could use a technique that is expensive but can give high qualitydata in cases with low signal, such as APPLIED BIOSYSTEMS's TAQMANassay, to measure the target DNA, and use a technique that is lessexpensive but requires a greater amount of genetic material to give goodquality data, such as AFFMETRIX's 500K Genechip, or MIPs, to measure theparental DNA.

Another example could be a situation in which a couple undergoing IVFtreatment have eggs harvested from the woman, and fertilized with spermfrom the man, producing eight viable embryos. A blastocyst is harvestedfrom each embryo, and the genomic data from the blastocysts are measuredusing TAQMAN Genotyping Assay. Meanwhile, the diploid data is measuredfrom tissue taken from both parents using MOLECULAR INVERSION PROBES.Haploid data from one of the man's sperm, and one of the woman's eggs isalso measured using MIPs. The genetic data of the parents is used toclean the SNP data of the eight blastocysts. The cleaned genetic data isthen used to allow predictions to be made concerning the potentialphenotypes of the embryos. Two embryos are selected which have the mostpromising profile, and allowed to implant in the woman's uterus.

Another example could be a situation where a pregnant woman whosehusband has a family history of Tay-Sachs disease wants to know if thefetus she is carrying is genetically susceptible, but she does not wantto undergo amniocentesis, as it carries a significant risk ofmiscarriage. She has her blood drawn, some fetal DNA is isolated fromher blood, and that DNA is analyzed using MIPs. She and her husband hadalready had their full genomic data analyzed previously and it isavailable in silico. The doctor is able to use the in silico knowledgeof the parental genomes and the method disclosed herein to clean thefetal DNA data, and check if the critical gene that is responsible forTay-Sachs disease is present in the genome of the fetus.

Another example could be a situation where a 44-year old pregnant womanis concerned that the fetus she is carrying may have Downs Syndrome. Sheis wary of having an intrusive technique used for pre-natal diagnosis,given a personal history of miscarriages, so she chooses to have herblood analyzed. The health care practitioner is able to find fetal cellsin the maternal blood sample, and using the method disclosed herein,together with the knowledge of the woman's own genetic data, is able todiagnose for aneuploidy.

Another example could be a situation where a couple are undergoing IVFtreatment; they have eggs harvested from the woman, and fertilized withsperm from the man, producing nine viable embryos. A blastocyst isharvested from each embryo, and the genomic data from the blastocystsare measured using an ILLUMINA BEAD ARRAY. Meanwhile, the diploid datais measured from tissue taken from both parents using MOLECULARINVERSION PROBES. Haploid data from the father's sperm is measured usingthe same method. There were no extra eggs available from the mother, sobulk diploid tissue samples are taken from her own father and mother,and a sperm sample from her father. They are all analyzed using MIPs andthe method disclosed herein is used to provide a genetic analysis forthe mother's genome. That data is then used, along with the father'sdiploid and haploid data, to allow a highly accurate analysis of thegenetic data of each of the blastocysts. Based on the phenotypicpredictions, the couple chooses three embryos to implant.

Another example could be a situation where a racehorse breeder wants toincrease the likelihood that the foals sired by his champion racehorsebecome champions themselves. He arranges for the desired mare to beimpregnated by IVF, and uses genetic data from the stallion and the mareto clean the genetic data measured from the viable embryos. The cleanedembryonic genetic data allows the breeder to find relevantgenotypic-phenotypic correlations and select the embryos forimplantation that are most likely to produce a desirable racehorse.

Another example could be a situation where a pregnant woman wants toknow whether the fetus she is carrying is predisposed towards anyserious illness. The father has since passed away, and so the haploidand diploid data generated from the father's brother and the father'sfather are used to help clean the genetic data of the fetus, measuredfrom fetal cells gathered during fetal blood sampling. A companycontracted by the health care practitioner uses the cleaned fetalgenetic data to provide a list of phenotypes that the fetus is likely toexhibit, along with the confidence of each prediction.

Another example could be an amniocentesis lab that must occasionallycontend with contaminated fetal genetic data due to poor laboratorytechniques. The disclosed method could be used to clean the contaminatedfetal genetic data using maternal and paternal genetic data. One couldimagine a situation where a laboratory is able to cut costs by relaxingsterility procedures, knowing that the disclosed method would be able tocompensate for an increased rate of contaminating DNA.

Another example could be a situation in which a woman in her forties isundergoing IVF to get pregnant. She wants to screen the embryos toselect the one(s) that are least likely to have a genetic illness, andare most likely to implant and carry to term. The IVF clinic she isusing harvests a blastocyst from each of the viable embryos, and usesstandard procedures to amplify the DNA, and measure key SNPs. Thetechnician then uses the methods disclosed herein to screen forchromosomal imbalances, and also to find and clean the genetic data ofthe embryos to make predictions about the phenotypic predispositions ofeach embryo.

Another example could be a situation where a pregnant woman hasamniocentesis, and the genetic material in the fetal cells in the bloodsample are used, along with the methods described herein to screen foraneuploidy and other chromosomal abnormalities.

Miscellaneous Notes

It is important to note that the method described herein concerns thecleaning of genetic data, and as all living creatures contain geneticdata, the methods are equally applicable to any human, animal, or plantthat inherits chromosomes from parents. The list of animals and plantscould include, but is not limited to: gorillas, chimpanzees, bonobos,cats, dogs, pandas, horses, cows, sheep, goats, pigs, cheetahs, tigers,lions, salmon, sharks, whales, camels, bison, manatees, elk, swordfish,dolphins, armadillos, wasps, cockroaches, worms, condors, eagles,sparrows, butterflies, sequoia, corn, wheat, rice, petunias, cow'svetch, sun flowers, ragweed, oak trees, chestnut trees, and head lice.

The measurement of genetic data is not a perfect process, especiallywhen the sample of genetic material is small. The measurements oftencontain incorrect measurements, unclear measurements, spuriousmeasurements, and missing measurements. The purpose of the methoddescribed herein is to detect and correct some or all of these errors.Using this method can improve the confidence with which the genetic datais known to a great extent. For example, using current techniques,uncleaned measured genetic data from DNA amplified from a single cellmay contain between 20% and 50% unmeasured regions, or allele dropouts.In some cases the genetic data could contain between 1% and 99%unmeasured regions, or allele dropouts. In addition, the confidence of agiven measured SNP is subject to errors as well.

In a case where the uncleaned data has an allele dropout rate ofapproximately 50%, it is expected that after applying the methoddisclosed herein the cleaned data will have correct allele calls in atleast 90% of the cases, and under ideal circumstances, this could riseto 99% or even higher. In a case where the uncleaned data has an alleledropout rate of approximately 80%, it is expected that after applyingthe method disclosed herein the cleaned data will have correct allelecalls in at least 95% of the cases, and under ideal circumstances, thiscould rise to 99.9% or even higher. In a case where the uncleaned datahas an allele dropout rate of approximately 90%, it is expected thatafter applying the method disclosed herein the cleaned data will havecorrect allele calls in at least 99% of the cases, and under idealcircumstances, this could rise to 99.99% or even higher. In cases wherea particular SNP measurement is made with a confidence rate close to90%, the cleaned data is expected to have SNP calls with confidence rateof over 95%, and in ideal cases, over 99%, or even higher. In caseswhere a particular SNP measurement is made with a confidence rate closeto 99%, the cleaned data is expected to have SNP calls with confidencerate of over 99.9%, and in ideal cases, over 99.99%, or even higher.

It is also important to note that the embryonic genetic data that can begenerated by measuring the amplified DNA from one blastomere can be usedfor multiple purposes. For example, it can be used for detectinganeuploides, uniparental disomy, sexing the individual, as well as formaking a plurality of phenotypic predictions. Currently, in IVFlaboratories, due to the techniques used, it is often the case that oneblastomere can only provide enough genetic material to test for onedisorder, such as aneuploidy, or a particular monogenic disease. Sincethe method disclosed herein has the common first step of measuring alarge set of SNPs from a blastomere, regardless of the type ofprediction to be made, a physician or parent is not forced to choose alimited number of disorders for which to screen. Instead, the optionexists to screen for as many genes and/or phenotypes as the state ofmedical knowledge will allow. With the disclosed method, the onlyadvantage to identifying particular conditions to screen for prior togenotyping the blastomere is that if it is decided that certain PSNPsare especially relevant, then a more appropriate set of NSNPs which aremore likely to cosegregate with the PSNPs of interest, can be selected,thus increasing the confidence of the allele calls of interest. Notethat even in the case where SNPs are not personalized ahead of time, theconfidences are expected to be more than adequate for the variouspurposes described herein.

Definitions

-   SNP (Single Nucleotide Polymorphism): A specific locus on a    chromosome that tends to show inter-individual variation.-   To call a SNP: to interrogate the identity of a particular base    pair, taking into account the direct and indirect evidence.-   To call an allele: to call a SNP.-   To clean genetic data: to take imperfect genetic data and correct    some or all of the errors, using genetic data of related individuals    and the method describe herein.-   Imperfect genetic data: genetic data with any of the following:    allele dropouts, unclear base pair measurements, incorrect base pair    measurements, spurious signals, or missing measurements.-   Confidence: the statistical likelihood that the called SNP, allele,    or set of alleles correctly represents the real genetic state of the    individual.-   Multigenic: affected by multiple genes, or alleles.-   Noisy genetic data: incomplete genetic data, also called incomplete    genetic data;-   Uncleaned genetic data: genetic data as measured, that is, with no    method has been used to correct for the presence of noise in the raw    genetic data; also called crude genetic data.-   Direct relation: mother, father, son, or daughter.-   Chromosomal Region: a segment of a chromosome, or a full chromosome.-   Parental Support: a name sometimes used for the disclosed method of    cleaning genetic data.-   Section of a chromosome: a section of a chromosome that can range in    size from one base pair to the entire chromosome.-   Section: a section of a chromosome.

1. A method for determining genetic data for DNA from cancer cells, themethod comprising: isolating cell-free DNA from a biological sample andamplifying a plurality of target loci from the isolated cell-free DNA toobtain amplification products; sequencing the amplification products bysequencing-by-synthesis to obtain genetic data for the plurality oftarget loci; and determining the most likely genetic data for DNA fromcancer cells based on the genetic data.
 2. The method of claim 1,wherein the biological sample is a blood sample.
 3. The method of claim1, wherein the amplification comprises targeted amplification to amplifythe target loci.
 4. The method of claim 1, wherein the amplificationfurther comprises universal amplification.
 5. The method of claim 1,wherein the sequencing-by-synthesis comprises clonal amplification andmeasurement of sequences of the clonally amplified DNA.
 6. The method ofclaim 1, wherein the target loci comprise SNP loci.
 7. The method ofclaim 6, wherein the confidence that each SNP is correctly called is atleast 95%.
 8. The method of claim 6, wherein the confidence that eachSNP is correctly called is at least 99%.
 9. The method of claim 1,wherein the genetic data obtained is noisy and comprises allele drop outerrors.
 10. The method of claim 1, wherein the genetic data obtained isnoisy and comprises measurement bias.
 11. The method of claim 1, whereinthe genetic data obtained is noisy and comprises incorrect measurements.12. The method of claim 1, further comprising normalizing the geneticdata for differences in amplification and/or measurement efficiencybetween the loci.